Lucene - Core
  1. Lucene - Core
  2. LUCENE-3305

Kuromoji code donation - a new Japanese morphological analyzer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese morphological analyzer to the Apache Software Foundation in the hope that it will be useful to Lucene and Solr users in Japan and elsewhere.

      The project was started in 2010 since we couldn't find any high-quality, actively maintained and easy-to-use Java-based Japanese morphological analyzers, and these become many of our design goals for Kuromoji.

      Kuromoji also has a segmentation mode that is particularly useful for search, which we hope will interest Lucene and Solr users. Compound-nouns, such as 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are segmented as one token with most analyzers. As a result, a search for 空港 (airport) or 新聞 (newspaper) will not give you a for in these words. Kuromoji can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what you would want for search and you'll get a hit.

      We also wanted to make sure the technology has a license that makes it compatible with other Apache Software Foundation software to maximize its usefulness. Kuromoji has an Apache License 2.0 and all code is currently owned by Atilika Inc. The software has been developed by my good friend and ex-colleague Masaru Hasegawa and myself.

      Kuromoji uses the so-called IPADIC for its dictionary/statistical model and its license terms are described in NOTICE.txt.

      I'll upload code distributions and their corresponding hashes and I'd very much like to start the code grant process. I'm also happy to provide patches to integrate Kuromoji into the codebase, if you prefer that.

      Please advise on how you'd like me to proceed with this. Thank you.

      1. wordid0.patch
        5 kB
        Christian Moen
      2. LUCENE-3305.patch
        448 kB
        Simon Willnauer
      3. LUCENE-3305.patch
        336 kB
        Robert Muir
      4. kuromoji-solr-0.5.3-asf.tar.gz
        9 kB
        Christian Moen
      5. kuromoji-solr-0.5.3.tar.gz
        9 kB
        Christian Moen
      6. Kuromoji short overview .pdf
        247 kB
        Christian Moen
      7. kuromoji-0.7.6-asf.tar.gz
        141 kB
        Christian Moen
      8. kuromoji-0.7.6.tar.gz
        142 kB
        Christian Moen
      9. ip-clearance-Kuromoji.xml
        6 kB
        Simon Willnauer
      10. ip-clearance-Kuromoji.xml
        6 kB
        Simon Willnauer

        Activity

        Hide
        Christian Moen added a comment -

        Kuromoji - a Japanese morphological analyzer

        Show
        Christian Moen added a comment - Kuromoji - a Japanese morphological analyzer
        Hide
        Christian Moen added a comment -

        Kuromoji Solr integration

        Show
        Christian Moen added a comment - Kuromoji Solr integration
        Hide
        Christian Moen added a comment -

        MD5 hashes for the attachments are as follows:

        MD5 (kuromoji-0.7.6.tar.gz) = 70d3d2f69f0511b86ebe11484cbe1313
        MD5 (kuromoji-solr-0.5.3.tar.gz) = b9a54698c9aebc264845e64d3904642d
        
        Show
        Christian Moen added a comment - MD5 hashes for the attachments are as follows: MD5 (kuromoji-0.7.6.tar.gz) = 70d3d2f69f0511b86ebe11484cbe1313 MD5 (kuromoji-solr-0.5.3.tar.gz) = b9a54698c9aebc264845e64d3904642d
        Hide
        Christian Moen added a comment - - edited

        Attaching a brief presentation

        Show
        Christian Moen added a comment - - edited Attaching a brief presentation
        Hide
        Simon Willnauer added a comment -

        WOW this is awesome. It seems we need to file some IP clearance here since this is a substantial contribution not developed in the ASF source control or on the mailing list. I will figure out the process here.

        I looked briefly at the sources here and I think we need to put this into a patch rather into a tar.gz. Some of the files don't have an apache header and some of the files state a copyright in the ASL 2 header. Basically for the code grant you need to put "our" ASL header into each file.
        We also need to apply these sources to our source tree so it is very likely that this goes under /modules/analysis/common can you try to create a patch against trunk? if its is too much of a hassle you can also move the solr integration to a different issue.

        thanks simon

        Show
        Simon Willnauer added a comment - WOW this is awesome. It seems we need to file some IP clearance here since this is a substantial contribution not developed in the ASF source control or on the mailing list. I will figure out the process here. I looked briefly at the sources here and I think we need to put this into a patch rather into a tar.gz. Some of the files don't have an apache header and some of the files state a copyright in the ASL 2 header. Basically for the code grant you need to put "our" ASL header into each file. We also need to apply these sources to our source tree so it is very likely that this goes under /modules/analysis/common can you try to create a patch against trunk? if its is too much of a hassle you can also move the solr integration to a different issue. thanks simon
        Hide
        Christian Moen added a comment -

        Thanks a lot, Simon. I wasn't sure when we'd update the headers as part of the process, so thanks for clarifying that, too.

        Kuromoji downloads IPADIC as part of its build (from our server in Japan) to make its data structures, which it bundles into its jar file (becomes 11M, but can be made a lot smaller). Building also requires more than default heap-space, so it's build is a little convoluted and different from the other code in /modules/analysis/common.

        Kuromoji is also usable independently from search, although, even though search perhaps is its most important application. Would it be a good idea that I make a patch that puts it in /modules/analysis/kuromoji for now and that we take things from there?

        The quickest way to get Kuromoji in there would be to check the jar file /modules/analysis/kuromoji/lib, but I'm not sure that's a good way to go.

        I'll follow up in whatever way you prefer. Thanks again!

        Show
        Christian Moen added a comment - Thanks a lot, Simon. I wasn't sure when we'd update the headers as part of the process, so thanks for clarifying that, too. Kuromoji downloads IPADIC as part of its build (from our server in Japan) to make its data structures, which it bundles into its jar file (becomes 11M, but can be made a lot smaller). Building also requires more than default heap-space, so it's build is a little convoluted and different from the other code in /modules/analysis/common. Kuromoji is also usable independently from search, although, even though search perhaps is its most important application. Would it be a good idea that I make a patch that puts it in /modules/analysis/kuromoji for now and that we take things from there? The quickest way to get Kuromoji in there would be to check the jar file /modules/analysis/kuromoji/lib, but I'm not sure that's a good way to go. I'll follow up in whatever way you prefer. Thanks again!
        Hide
        Robert Muir added a comment -

        I looked briefly at the sources here and I think we need to put this into a patch rather into a tar.gz. Some of the files don't have an apache header and some of the files state a copyright in the ASL 2 header. Basically for the code grant you need to put "our" ASL header into each file.

        But these things are separate, right? Can't he just fix the license headers and upload a new .tar.gz?

        I don't see anywhere that says a code grant should be a patch, this puts a burden on Christian to do all
        the work, and our trunk moves too fast. Lets defer creating a patch until the code grant stuff is over... anyone could then turn it into a patch.

        Show
        Robert Muir added a comment - I looked briefly at the sources here and I think we need to put this into a patch rather into a tar.gz. Some of the files don't have an apache header and some of the files state a copyright in the ASL 2 header. Basically for the code grant you need to put "our" ASL header into each file. But these things are separate, right? Can't he just fix the license headers and upload a new .tar.gz? I don't see anywhere that says a code grant should be a patch, this puts a burden on Christian to do all the work, and our trunk moves too fast. Lets defer creating a patch until the code grant stuff is over... anyone could then turn it into a patch.
        Hide
        Mark Miller added a comment -

        But these things are separate, right?

        Right - looks like all we need is the ASF copyright in the files. The rest can easily be handled after the grant goes through.

        Show
        Mark Miller added a comment - But these things are separate, right? Right - looks like all we need is the ASF copyright in the files. The rest can easily be handled after the grant goes through.
        Hide
        Christian Moen added a comment -

        Thanks, Robert and Mark.

        I'll upload new tarballs where the standard ASF license notice is being used in all Java source files and I've also removed author tags to comply better with code standards. I've removed any Atilika Inc. copyrights from NOTICE.txt in both tarballs.

        Show
        Christian Moen added a comment - Thanks, Robert and Mark. I'll upload new tarballs where the standard ASF license notice is being used in all Java source files and I've also removed author tags to comply better with code standards. I've removed any Atilika Inc. copyrights from NOTICE.txt in both tarballs.
        Hide
        Christian Moen added a comment - - edited

        Now uses standard ASF license notice in all Java source files.

        MD5 (kuromoji-0.7.6-asf.tar.gz) = a84f016bd5162e57423a1da181c25f36
        
        Show
        Christian Moen added a comment - - edited Now uses standard ASF license notice in all Java source files. MD5 (kuromoji-0.7.6-asf.tar.gz) = a84f016bd5162e57423a1da181c25f36
        Hide
        Christian Moen added a comment - - edited

        Now uses standard ASF license notice in all Java source files.

        MD5 (kuromoji-solr-0.5.3-asf.tar.gz) = a3e7d5afba64ec0843be6d4dbb95be1c
        
        Show
        Christian Moen added a comment - - edited Now uses standard ASF license notice in all Java source files. MD5 (kuromoji-solr-0.5.3-asf.tar.gz) = a3e7d5afba64ec0843be6d4dbb95be1c
        Hide
        Uwe Schindler added a comment -

        Code looks cool. I think we should first do the legal stuff and then produce patches. Robert is currently developing another morphological analyzer (Lucene-Gosen, https://code.google.com/p/lucene-gosen/), but this one uses a LGPL library that cannot be included with Lucene/Solr. The Lucene part has lots of cool attributes and additional TokenFilters, so maybe we combine lucene-gosen with this one (your Apache-2.0 and his TokenFilters+Attributes)? That would be really cool.

        Show
        Uwe Schindler added a comment - Code looks cool. I think we should first do the legal stuff and then produce patches. Robert is currently developing another morphological analyzer (Lucene-Gosen, https://code.google.com/p/lucene-gosen/ ), but this one uses a LGPL library that cannot be included with Lucene/Solr. The Lucene part has lots of cool attributes and additional TokenFilters, so maybe we combine lucene-gosen with this one (your Apache-2.0 and his TokenFilters+Attributes)? That would be really cool.
        Hide
        Christian Moen added a comment -

        Thanks, Uwe!

        I think we definitely should work together and combine the great work that Robert, Koji & co. have been doing on Lucene-GoSen with Kuromoji to make a highly attractive Japanese linguistics offering that is also an integrated part of Lucene/Solr.

        The attributes do indeed look very nice – excellent job! I have several improvements in mind for Kuromoji (and other Japanese related code) and I'm looking forward to working with you to improve some of these things.

        Additional to its license, an issue with GoSen (and Sen) used to be its segmentation quality. To my knowledge, these analyzers still don't support so-called "unknown words" which means that words that are not in the dictionaries are treated second-rate, which impacts negatively on segmentation quality.

        Show
        Christian Moen added a comment - Thanks, Uwe! I think we definitely should work together and combine the great work that Robert, Koji & co. have been doing on Lucene-GoSen with Kuromoji to make a highly attractive Japanese linguistics offering that is also an integrated part of Lucene/Solr. The attributes do indeed look very nice – excellent job! I have several improvements in mind for Kuromoji (and other Japanese related code) and I'm looking forward to working with you to improve some of these things. Additional to its license, an issue with GoSen (and Sen) used to be its segmentation quality. To my knowledge, these analyzers still don't support so-called "unknown words" which means that words that are not in the dictionaries are treated second-rate, which impacts negatively on segmentation quality.
        Hide
        Koji Sekiguchi added a comment -

        Hi Christian, it's been a long time. Contribution of Kuromoji to Lucene/Solr sounds really nice! As already Uwe mentioned, lucene-gosen has really good TokenFilters, those are org.apache packages and Apache License. It will be nice if this Japanese tokenizer uses them. Plus, lucene-gosen can use not only IPADIC, but also NAIST JDIC. I'd like the tokenizer to choose dictionary in the future release.

        Show
        Koji Sekiguchi added a comment - Hi Christian, it's been a long time. Contribution of Kuromoji to Lucene/Solr sounds really nice! As already Uwe mentioned, lucene-gosen has really good TokenFilters, those are org.apache packages and Apache License. It will be nice if this Japanese tokenizer uses them. Plus, lucene-gosen can use not only IPADIC, but also NAIST JDIC. I'd like the tokenizer to choose dictionary in the future release.
        Hide
        Christian Moen added a comment -

        久しぶりですよね。 Thanks a lot, Koji.

        I completely agree. If we can get Kuromoji into the codebase, I'm more than happy to submit patches for your filters so that they will work with Kuromoji.

        Kuromoji has preliminary support for UniDic and it sounds like a good idea to join effort on this as well. We could support them all; IPADIC, NAIST JDIC and UniDic.

        Show
        Christian Moen added a comment - 久しぶりですよね。 Thanks a lot, Koji. I completely agree. If we can get Kuromoji into the codebase, I'm more than happy to submit patches for your filters so that they will work with Kuromoji. Kuromoji has preliminary support for UniDic and it sounds like a good idea to join effort on this as well. We could support them all; IPADIC, NAIST JDIC and UniDic.
        Hide
        Christian Moen added a comment -

        Please let me know if you need paperwork from me to follow up on this. Thanks again.

        Show
        Christian Moen added a comment - Please let me know if you need paperwork from me to follow up on this. Thanks again.
        Hide
        Simon Willnauer added a comment -

        Hey Christian, I attache the IP-Clearance form for this code donation. What we need to wrap up this process is

        The CLA should go to the secretary, I still need to figure out where the code grant needs to go.

        Show
        Simon Willnauer added a comment - Hey Christian, I attache the IP-Clearance form for this code donation. What we need to wrap up this process is a code grant ( http://www.apache.org/licenses/software-grant.txt ) a CLA from each of you a list of all contributor if there are more than the two of you ( http://www.apache.org/licenses/icla.txt ) a CLA from the company owning the IP ( http://www.apache.org/licenses/cla-corporate.txt ) The CLA should go to the secretary, I still need to figure out where the code grant needs to go.
        Hide
        Simon Willnauer added a comment -

        koji, I took the issue until the code grant is due etc.

        Show
        Simon Willnauer added a comment - koji, I took the issue until the code grant is due etc.
        Hide
        Christian Moen added a comment -

        Thanks, Simon. Please let me know where I should send the code grant and I'll file the paperwork.

        Show
        Christian Moen added a comment - Thanks, Simon. Please let me know where I should send the code grant and I'll file the paperwork.
        Hide
        Christian Moen added a comment -

        Hello again, Simon. Has there been any update as to where I should send the code grant? Many thanks.

        Show
        Christian Moen added a comment - Hello again, Simon. Has there been any update as to where I should send the code grant? Many thanks.
        Hide
        Simon Willnauer added a comment -

        Christian, apparently we just handle this as the CLA. You fill it out, scan it and send it to secretary@apache.org. Make sure you use the ICLA details when you file it.

        let me know once you those are send.

        Show
        Simon Willnauer added a comment - Christian, apparently we just handle this as the CLA. You fill it out, scan it and send it to secretary@apache.org. Make sure you use the ICLA details when you file it. let me know once you those are send.
        Hide
        Simon Willnauer added a comment -

        I am going to be away for 2 weeks if somebody wants to continue driving this code grant. please do. Otherwise @christian sorry for the break I will continue once I am back or here and there if I find a computer

        simon

        Show
        Simon Willnauer added a comment - I am going to be away for 2 weeks if somebody wants to continue driving this code grant. please do. Otherwise @christian sorry for the break I will continue once I am back or here and there if I find a computer simon
        Hide
        Christian Moen added a comment -

        Hello Simon. I'll file the paperwork over the next couple of days by email and copy you. Have a brilliant vacation!

        Show
        Christian Moen added a comment - Hello Simon. I'll file the paperwork over the next couple of days by email and copy you. Have a brilliant vacation!
        Hide
        Christian Moen added a comment -

        Hello again, Simon. I've filed the paperwork and copied you on email. Hope you're enjoying your vacation!

        Show
        Christian Moen added a comment - Hello again, Simon. I've filed the paperwork and copied you on email. Hope you're enjoying your vacation!
        Hide
        Simon Willnauer added a comment -

        Christina, thanks for filing the paper work, I just called out a vote on dev@l.a.o hope to get this done soon!

        simon

        Show
        Simon Willnauer added a comment - Christina, thanks for filing the paper work, I just called out a vote on dev@l.a.o hope to get this done soon! simon
        Hide
        Simon Willnauer added a comment -

        Christian, I see a couple of files in the resource folders that don't have a license header, we need to make sure that all files do have an ASL 2 license header before we can finish the IP clearance process. Yet, I don't know much about this segmenter but I guess it works based on a dictionary, no? If so where are the dictionary files since I only see resource files in the test folder but maybe I miss something?

        simon

        Show
        Simon Willnauer added a comment - Christian, I see a couple of files in the resource folders that don't have a license header, we need to make sure that all files do have an ASL 2 license header before we can finish the IP clearance process. Yet, I don't know much about this segmenter but I guess it works based on a dictionary, no? If so where are the dictionary files since I only see resource files in the test folder but maybe I miss something? simon
        Hide
        Christian Moen added a comment -

        Please see NOTICE.txt for information on the dictionaries.

        Kindly let me know which files that require a license header and how I should proceed to provide a revised version. Do you prefer a complete tarball or can I attach the filed individually to this JIRA?

        Thanks!

        Show
        Christian Moen added a comment - Please see NOTICE.txt for information on the dictionaries. Kindly let me know which files that require a license header and how I should proceed to provide a revised version. Do you prefer a complete tarball or can I attach the filed individually to this JIRA? Thanks!
        Hide
        Simon Willnauer added a comment -

        Please see NOTICE.txt for information on the dictionaries.

        so those dictionaries are not ASL licensed, right? I need to check with legal if we can include them into our distribution at all so we need to figure that out first.

        Show
        Simon Willnauer added a comment - Please see NOTICE.txt for information on the dictionaries. so those dictionaries are not ASL licensed, right? I need to check with legal if we can include them into our distribution at all so we need to figure that out first.
        Hide
        Christian Moen added a comment -

        Correct. You should definitely check this with legal. I've tried to point this out in the description and in my email with the secretary as well. If there are questions or concerns my legal counsel can possibly assist, but I guess this is something the ASF has to consider by itself.

        Show
        Christian Moen added a comment - Correct. You should definitely check this with legal. I've tried to point this out in the description and in my email with the secretary as well. If there are questions or concerns my legal counsel can possibly assist, but I guess this is something the ASF has to consider by itself.
        Hide
        Simon Willnauer added a comment -

        FYI - I created an issue on legal to categorize the IPADIC license LEGAL-97

        Show
        Simon Willnauer added a comment - FYI - I created an issue on legal to categorize the IPADIC license LEGAL-97
        Hide
        Robert Muir added a comment -

        Now that we have some feedback on LEGAL-97, what is the next step we need to do to move forward with this feature?

        Show
        Robert Muir added a comment - Now that we have some feedback on LEGAL-97 , what is the next step we need to do to move forward with this feature?
        Hide
        Simon Willnauer added a comment -

        According to LEGAL-97 we can include the dict files. That means we can finish this code donation and get everything in shape for a commit. I will finish the paper work once I am back from traveling.

        Show
        Simon Willnauer added a comment - According to LEGAL-97 we can include the dict files. That means we can finish this code donation and get everything in shape for a commit. I will finish the paper work once I am back from traveling.
        Hide
        Christian Moen added a comment -

        Thanks for the follow-up, Robert and Simon. I've started working on a patch.

        Show
        Christian Moen added a comment - Thanks for the follow-up, Robert and Simon. I've started working on a patch.
        Hide
        Simon Willnauer added a comment -

        here is an updated ip-clearance file. Since this is the first time I do this I would appreciate some feedback or help from other with more experience here. Grant, does that look fine to you?

        I think if we are ok with this we can go ahead and call the vote on incubator.

        Show
        Simon Willnauer added a comment - here is an updated ip-clearance file. Since this is the first time I do this I would appreciate some feedback or help from other with more experience here. Grant, does that look fine to you? I think if we are ok with this we can go ahead and call the vote on incubator.
        Hide
        Robert Muir added a comment -

        Just a ping... whats our next step?

        Show
        Robert Muir added a comment - Just a ping... whats our next step?
        Hide
        Grant Ingersoll added a comment -

        File looks good to me. You need to check in the file to https://svn.apache.org/repos/asf/incubator/public/trunk/site-author/ip-clearance and then call a vote on general@incubator.apache.org (there should be examples of this in the archives for that list). Vote is lazy consensus, so don't expect too much feedback. Once that vote passes, then the code can be committed.

        Show
        Grant Ingersoll added a comment - File looks good to me. You need to check in the file to https://svn.apache.org/repos/asf/incubator/public/trunk/site-author/ip-clearance and then call a vote on general@incubator.apache.org (there should be examples of this in the archives for that list). Vote is lazy consensus, so don't expect too much feedback. Once that vote passes, then the code can be committed.
        Hide
        Simon Willnauer added a comment -

        I committed the file to the incubator ip-clearance in revision 1199470. I will go ahead an call an incubator vote now. thanks grant

        Show
        Simon Willnauer added a comment - I committed the file to the incubator ip-clearance in revision 1199470. I will go ahead an call an incubator vote now. thanks grant
        Hide
        Simon Willnauer added a comment -

        I send the vote to general@incubator ...we will see in 72h! thanks folks

        Show
        Simon Willnauer added a comment - I send the vote to general@incubator ...we will see in 72h! thanks folks
        Hide
        Simon Willnauer added a comment -

        here is an initial patch. nothing special just basic integration into the modules/analysis tree. I added a task taht downloads the dicts and puts them in place so I could run the tests. all passing for me... still lots of work but its a start

        Show
        Simon Willnauer added a comment - here is an initial patch. nothing special just basic integration into the modules/analysis tree. I added a task taht downloads the dicts and puts them in place so I could run the tests. all passing for me... still lots of work but its a start
        Hide
        Robert Muir added a comment -

        looks like we want to add the Lucene analyzer/tokenizer and solr factories from kuromoji-solr-0.5.3-asf.tar.gz

        I'd say once we get stuff going, maybe just download the dictionary, build it, and when committing commit
        the built dictionary under resources/ folder (this is where the script puts it).

        I think for this kind of feature it might be hard to iterate with patches, we should maybe try to get it
        in SVN (trunk) initially and iterate with smaller issues. The code looks pretty clean to me already.

        The produced jar file is somewhat large but I think its still reasonable, so I think we should look past
        this for now? working with Sen before I know some ways we can shrink this a lot, but that would be best
        on a future issue.

        Some java6 apis are here (e.g. unicode normalization). Christian can you confirm this is only for the
        dictionary-build stage? It looked to me like its only needed for ipadic/unidic parsing, but not
        custom dictionary support.

        If its only for the build stage, personally I think thats fine for 3.x too, because I'm suggesting we
        commit a 'built' dictionary and we tell people if they want to compile the dictionary themselves they
        need java6? We could put the dictionary-building under a tools/ directory thats java6-only, or we could
        depend on ICU for just the tools/ piece (i think we already have such hacks for generating jflex rules
        for StandardTokenizer) and be fine on java5.

        +1 for the GraphVizFormatter...

        Show
        Robert Muir added a comment - looks like we want to add the Lucene analyzer/tokenizer and solr factories from kuromoji-solr-0.5.3-asf.tar.gz I'd say once we get stuff going, maybe just download the dictionary, build it, and when committing commit the built dictionary under resources/ folder (this is where the script puts it). I think for this kind of feature it might be hard to iterate with patches, we should maybe try to get it in SVN (trunk) initially and iterate with smaller issues. The code looks pretty clean to me already. The produced jar file is somewhat large but I think its still reasonable, so I think we should look past this for now? working with Sen before I know some ways we can shrink this a lot, but that would be best on a future issue. Some java6 apis are here (e.g. unicode normalization). Christian can you confirm this is only for the dictionary-build stage? It looked to me like its only needed for ipadic/unidic parsing, but not custom dictionary support. If its only for the build stage, personally I think thats fine for 3.x too, because I'm suggesting we commit a 'built' dictionary and we tell people if they want to compile the dictionary themselves they need java6? We could put the dictionary-building under a tools/ directory thats java6-only, or we could depend on ICU for just the tools/ piece (i think we already have such hacks for generating jflex rules for StandardTokenizer) and be fine on java5. +1 for the GraphVizFormatter...
        Hide
        Simon Willnauer added a comment -

        +1 to all your comments. For 3.x lets figure this out somewhere else... first iterate on trunk and when we have it at a reasonable stage we backport it to 3.x. The vote succeeded so we are good to go!

        Show
        Simon Willnauer added a comment - +1 to all your comments. For 3.x lets figure this out somewhere else... first iterate on trunk and when we have it at a reasonable stage we backport it to 3.x. The vote succeeded so we are good to go!
        Hide
        Simon Willnauer added a comment -

        marking fix version 4.0 - lets open a new issue for backporting...

        Show
        Simon Willnauer added a comment - marking fix version 4.0 - lets open a new issue for backporting...
        Hide
        Christian Moen added a comment -

        Thanks a lot, Simon!

        Robert, I agree completely with your comments. The Unicode normalization is only done at dictionary build time. Simon has turned it on by default – its previous default was off. Perhaps it makes sense to have it on in Lucene's case...

        Simon, the TokenizerRunner class doesn't seem to be included in the patch, which might be fine. It's not strictly necessary for Lucene, but I think it's useful to keep it there so the analyzer can easily be run from the command line. The DebugTokenizer and GraphvizFormatter is there already, which aren't strictly necessary either, but sometimes quite useful, so I'm think we should add the TokenizerRunner as well – at least for now.

        Tests didn't pass in my case, but I'll look more into this soon. My tomorrow is very busy, but I'll have time for this on Wednesday.

        Show
        Christian Moen added a comment - Thanks a lot, Simon! Robert, I agree completely with your comments. The Unicode normalization is only done at dictionary build time. Simon has turned it on by default – its previous default was off. Perhaps it makes sense to have it on in Lucene's case... Simon, the TokenizerRunner class doesn't seem to be included in the patch, which might be fine. It's not strictly necessary for Lucene, but I think it's useful to keep it there so the analyzer can easily be run from the command line. The DebugTokenizer and GraphvizFormatter is there already, which aren't strictly necessary either, but sometimes quite useful, so I'm think we should add the TokenizerRunner as well – at least for now. Tests didn't pass in my case, but I'll look more into this soon. My tomorrow is very busy, but I'll have time for this on Wednesday.
        Hide
        Robert Muir added a comment -

        I created a branch here (https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3305)
        with an initial import of this code, only minor tweaks to get things working in the build so far.

        Show
        Robert Muir added a comment - I created a branch here ( https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3305 ) with an initial import of this code, only minor tweaks to get things working in the build so far.
        Hide
        Christian Moen added a comment -

        Thanks, Robert.

        I've built the branch. I needed to do ant test -Dargs="-Dfile.encoding=UTF-8" in order to make all the Kuromoji tests pass as some of them assume UTF-8 file encoding. (MacRoman is default on my system.)

        I really appreciate the efforts yourself and Simon have put it. I also hope to make some meaningful contributions to make sure Kuromoji integrates and works works well with Solr and Lucene.

        Show
        Christian Moen added a comment - Thanks, Robert. I've built the branch. I needed to do ant test -Dargs="-Dfile.encoding=UTF-8" in order to make all the Kuromoji tests pass as some of them assume UTF-8 file encoding. (MacRoman is default on my system.) I really appreciate the efforts yourself and Simon have put it. I also hope to make some meaningful contributions to make sure Kuromoji integrates and works works well with Solr and Lucene.
        Hide
        Robert Muir added a comment -

        I've built the branch. I needed to do ant test -Dargs="-Dfile.encoding=UTF-8" in order to make all the Kuromoji tests pass as some of them assume UTF-8 file encoding. (MacRoman is default on my system.)

        This sounds like a bug in the build, you shouldn't have to do that (it should be set already). However, my default encoding is UTF-8 so thats why i didn't catch it. I'll look into this.

        Show
        Robert Muir added a comment - I've built the branch. I needed to do ant test -Dargs="-Dfile.encoding=UTF-8" in order to make all the Kuromoji tests pass as some of them assume UTF-8 file encoding. (MacRoman is default on my system.) This sounds like a bug in the build, you shouldn't have to do that (it should be set already). However, my default encoding is UTF-8 so thats why i didn't catch it. I'll look into this.
        Hide
        Christian Moen added a comment -

        Patch to fix zero wordid issue. Backport of fix from kuromoji 0.7.7-SNAPSHOT on Github.

        Show
        Christian Moen added a comment - Patch to fix zero wordid issue. Backport of fix from kuromoji 0.7.7-SNAPSHOT on Github.
        Hide
        Uwe Schindler added a comment -

        Hi Christian,
        thanks for the fix. I will aply the patch to the branch. The tests testYabottai() and testTsukitosha() are not hurting, but have no meaning for our variant, because wordid=0 and last wordid have different words (because we presort the whole dictionary for the FST). To make the test really use wordid=0, I should lookup the actual dictionary entries of first and last word.

        Show
        Uwe Schindler added a comment - Hi Christian, thanks for the fix. I will aply the patch to the branch. The tests testYabottai() and testTsukitosha() are not hurting, but have no meaning for our variant, because wordid=0 and last wordid have different words (because we presort the whole dictionary for the FST). To make the test really use wordid=0, I should lookup the actual dictionary entries of first and last word.
        Hide
        Uwe Schindler added a comment -

        Committed development branch revision: 1229948
        Thanks Christian!

        Show
        Uwe Schindler added a comment - Committed development branch revision: 1229948 Thanks Christian!
        Hide
        Robert Muir added a comment -

        Thank you for fixing that bug!

        By the way, I've been reviewing the differences between mecab and kuromoji. In general the differences seem fine to me,
        actually in Kuromoji's favor (at least for search). Most revolve around middle-dot:

        sentence: 私がエドガー・ドガです。
        mecab: [私, が, エドガー・ドガ, です]
        kuromoji: [私, が, エドガー, ドガ, です]
        

        So I think these are improvements, at least for search (e.g. Kuromoji splits the first/last name here).

        But, there is often funkiness revolving caused by the normalizeEntries option, which, if
        an entry is not NFKC-normalized, it adds an NFKC-normalized entry with the same costs etc.

        However, I think in some cases this skews the costs because e.g. half-width and full-width numbers have different costs.
        So by adding normalized entries with the full-width cost, we sometimes get worse tokenization.

        sentence: Windows95対応のゲームを動かしたいのです。
        mecab: [Windows, 95, 対応, の, ゲーム, を, 動かし, たい, の, です]
        kuromoji: [Windows, 9, 5, 対応, の, ゲーム, を, 動かし, たい, の, です]
        

        I changed the default locally of 'normalizeEntries' to false and it seemed to totally fix this, and all
        the differences vs. mecab then seemed positive.

        I think we should disable normalizeEntries by default so that no costs are potentially skewed... opinions?

        Show
        Robert Muir added a comment - Thank you for fixing that bug! By the way, I've been reviewing the differences between mecab and kuromoji. In general the differences seem fine to me, actually in Kuromoji's favor (at least for search). Most revolve around middle-dot: sentence: 私がエドガー・ドガです。 mecab: [私, が, エドガー・ドガ, です] kuromoji: [私, が, エドガー, ドガ, です] So I think these are improvements, at least for search (e.g. Kuromoji splits the first/last name here). But, there is often funkiness revolving caused by the normalizeEntries option, which, if an entry is not NFKC-normalized, it adds an NFKC-normalized entry with the same costs etc. However, I think in some cases this skews the costs because e.g. half-width and full-width numbers have different costs. So by adding normalized entries with the full-width cost, we sometimes get worse tokenization. sentence: Windows95対応のゲームを動かしたいのです。 mecab: [Windows, 95, 対応, の, ゲーム, を, 動かし, たい, の, です] kuromoji: [Windows, 9, 5, 対応, の, ゲーム, を, 動かし, たい, の, です] I changed the default locally of 'normalizeEntries' to false and it seemed to totally fix this, and all the differences vs. mecab then seemed positive. I think we should disable normalizeEntries by default so that no costs are potentially skewed... opinions?
        Hide
        Christian Moen added a comment -

        The middle dot character (nakaguro) is treated as character class SYMBOL in order to provoke a split. This is by design and we override IPADIC in this case since we feel the split behaviour is more reasonable for most applications.

        Having said this, I'd expect input

        私がエドガー・ドガです。
        

        to produce segmentation

        私 が エドガー ・ ドガ です 。
        

        The middle dot ・ seems to have been removed in your case. Are you deliberately removing it somewhere?

        You're right about the NFKC-normalization. It's turned off by default in the Kuromoji on Github. I think disabling this is a reasonable default, but I think it's a good idea to have the option of doing NFKC-normalization prior to segmentation in the Tokenizer/Analyzer (Lucene).

        Show
        Christian Moen added a comment - The middle dot character (nakaguro) is treated as character class SYMBOL in order to provoke a split. This is by design and we override IPADIC in this case since we feel the split behaviour is more reasonable for most applications. Having said this, I'd expect input 私がエドガー・ドガです。 to produce segmentation 私 が エドガー ・ ドガ です 。 The middle dot ・ seems to have been removed in your case. Are you deliberately removing it somewhere? You're right about the NFKC-normalization. It's turned off by default in the Kuromoji on Github. I think disabling this is a reasonable default, but I think it's a good idea to have the option of doing NFKC-normalization prior to segmentation in the Tokenizer/Analyzer (Lucene).
        Hide
        Robert Muir added a comment -

        The middle dot ・ seems to have been removed in your case. Are you deliberately removing it somewhere?

        Just in my debugging

        (Separately: i did add an option to doTokenize to not emit punctuation tokens, and the lucene analyzer uses it by default, otherwise
        index size and searches are affected by many tokens like "。"... but thats unrelated here)

        You're right about the NFKC-normalization. It's turned off by default in the Kuromoji on Github. I think disabling this is a reasonable default, but I think it's a good idea to have the option of doing NFKC-normalization prior to segmentation in the Tokenizer/Analyzer (Lucene).

        Yeah i agree, we can add a charfilter that uses the incremental normalization api.

        Show
        Robert Muir added a comment - The middle dot ・ seems to have been removed in your case. Are you deliberately removing it somewhere? Just in my debugging (Separately: i did add an option to doTokenize to not emit punctuation tokens, and the lucene analyzer uses it by default, otherwise index size and searches are affected by many tokens like "。"... but thats unrelated here) You're right about the NFKC-normalization. It's turned off by default in the Kuromoji on Github. I think disabling this is a reasonable default, but I think it's a good idea to have the option of doing NFKC-normalization prior to segmentation in the Tokenizer/Analyzer (Lucene). Yeah i agree, we can add a charfilter that uses the incremental normalization api.
        Hide
        Robert Muir added a comment -

        updated patch, showing differences from branch and trunk.

        I think this is ready to commit to trunk.

        Show
        Robert Muir added a comment - updated patch, showing differences from branch and trunk. I think this is ready to commit to trunk.
        Hide
        Robert Muir added a comment -

        I committed this to trunk.

        I'll let hudson chew on it a bit before backporting to branch 3.x, but in general
        I think we've hammered on this enough that its ready to be backported too.

        Its a big contribution so I'm sure minor things might pop up but we can just open new issues...

        Big thanks to Christian for the contribution... this is awesome!

        Show
        Robert Muir added a comment - I committed this to trunk. I'll let hudson chew on it a bit before backporting to branch 3.x, but in general I think we've hammered on this enough that its ready to be backported too. Its a big contribution so I'm sure minor things might pop up but we can just open new issues... Big thanks to Christian for the contribution... this is awesome!
        Hide
        Simon Willnauer added a comment -

        I committed this to trunk.

        YAY! thanks everyone!

        Show
        Simon Willnauer added a comment - I committed this to trunk. YAY! thanks everyone!
        Hide
        Robert Muir added a comment -

        Yes, thanks also to Uwe for lots of work compressing data and refactoring, and Mike for tuning the fsts.

        Show
        Robert Muir added a comment - Yes, thanks also to Uwe for lots of work compressing data and refactoring, and Mike for tuning the fsts.
        Hide
        Robert Muir added a comment -

        backported to 3.x. Thanks again Christian!!! Such an awesome addition!

        Show
        Robert Muir added a comment - backported to 3.x. Thanks again Christian!!! Such an awesome addition!
        Hide
        Christian Moen added a comment -

        Thanks for excellent work integrating Kuromoji, Robert. Also thanks to everybody who has made helped and made this happen.

        Show
        Christian Moen added a comment - Thanks for excellent work integrating Kuromoji, Robert. Also thanks to everybody who has made helped and made this happen.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Christian Moen
          • Votes:
            6 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development