Lucene - Core
  1. Lucene - Core
  2. LUCENE-3305

Kuromoji code donation - a new Japanese morphological analyzer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese morphological analyzer to the Apache Software Foundation in the hope that it will be useful to Lucene and Solr users in Japan and elsewhere.

      The project was started in 2010 since we couldn't find any high-quality, actively maintained and easy-to-use Java-based Japanese morphological analyzers, and these become many of our design goals for Kuromoji.

      Kuromoji also has a segmentation mode that is particularly useful for search, which we hope will interest Lucene and Solr users. Compound-nouns, such as 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are segmented as one token with most analyzers. As a result, a search for 空港 (airport) or 新聞 (newspaper) will not give you a for in these words. Kuromoji can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what you would want for search and you'll get a hit.

      We also wanted to make sure the technology has a license that makes it compatible with other Apache Software Foundation software to maximize its usefulness. Kuromoji has an Apache License 2.0 and all code is currently owned by Atilika Inc. The software has been developed by my good friend and ex-colleague Masaru Hasegawa and myself.

      Kuromoji uses the so-called IPADIC for its dictionary/statistical model and its license terms are described in NOTICE.txt.

      I'll upload code distributions and their corresponding hashes and I'd very much like to start the code grant process. I'm also happy to provide patches to integrate Kuromoji into the codebase, if you prefer that.

      Please advise on how you'd like me to proceed with this. Thank you.

      1. LUCENE-3305.patch
        336 kB
        Robert Muir
      2. wordid0.patch
        5 kB
        Christian Moen
      3. LUCENE-3305.patch
        448 kB
        Simon Willnauer
      4. ip-clearance-Kuromoji.xml
        6 kB
        Simon Willnauer
      5. ip-clearance-Kuromoji.xml
        6 kB
        Simon Willnauer
      6. kuromoji-solr-0.5.3-asf.tar.gz
        9 kB
        Christian Moen
      7. kuromoji-0.7.6-asf.tar.gz
        141 kB
        Christian Moen
      8. Kuromoji short overview .pdf
        247 kB
        Christian Moen
      9. kuromoji-solr-0.5.3.tar.gz
        9 kB
        Christian Moen
      10. kuromoji-0.7.6.tar.gz
        142 kB
        Christian Moen

        Activity

        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Fix Version/s 3.6 [ 12319070 ]
        Resolution Fixed [ 1 ]
        Robert Muir made changes -
        Assignee Simon Willnauer [ simonw ] Robert Muir [ rcmuir ]
        Robert Muir made changes -
        Attachment LUCENE-3305.patch [ 12510412 ]
        Christian Moen made changes -
        Attachment wordid0.patch [ 12510169 ]
        Simon Willnauer made changes -
        Fix Version/s 4.0 [ 12314025 ]
        Simon Willnauer made changes -
        Attachment LUCENE-3305.patch [ 12503436 ]
        Simon Willnauer made changes -
        Attachment ip-clearance-Kuromoji.xml [ 12496462 ]
        Simon Willnauer made changes -
        Assignee Koji Sekiguchi [ koji ] Simon Willnauer [ simonw ]
        Simon Willnauer made changes -
        Attachment ip-clearance-Kuromoji.xml [ 12486570 ]
        Koji Sekiguchi made changes -
        Assignee Koji Sekiguchi [ koji ]
        Christian Moen made changes -
        Attachment kuromoji-solr-0.5.3-asf.tar.gz [ 12486183 ]
        Christian Moen made changes -
        Attachment kuromoji-0.7.6-asf.tar.gz [ 12486182 ]
        Christian Moen made changes -
        Description Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese morphological analyzer to the Apache Software Foundation in the hope that it will be useful to Lucene and Solr users in Japan and elsewhere.

        The project was started in 2010 since we couldn't find any high-quality, actively maintained and easy-to-use Java-based Japanese morphological analyzers, and these become many of our design goals for Kuromoji.

        Kuromoji also has a segmentation mode that is particularly useful for search, which we hope will interest Lucene and Solr users. Compound-nouns, such as 関西国際空港 (Kansai International Airports) and 日本経済新聞 (Nikkei Newspaper), are segmented as one token with most analyzers. As a result, a search for 空港 (airport) or 新聞 (newspaper) will not give you a for in these words. Kuromoji can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what you would want for search and you'll get a hit.

        We also wanted to make sure the technology has a license that makes it compatible with other Apache Software Foundation software to maximize its usefulness. Kuromoji has an Apache License 2.0 and all code is currently owned by Atilika Inc. The software has been developed by my good friend and ex-colleague Masaru Hasegawa and myself.

        Kuromoji uses the so-called IPADIC for its dictionary/statistical model and its license terms are described in NOTICE.txt.

        I'll upload code distributions and their corresponding hashes and I'd very much like to start the code grant process. I'm also happy to provide patches to integrate Kuromoji into the codebase, if you prefer that.

        Please advise on how you'd like me to proceed with this. Thank you.
        Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese morphological analyzer to the Apache Software Foundation in the hope that it will be useful to Lucene and Solr users in Japan and elsewhere.

        The project was started in 2010 since we couldn't find any high-quality, actively maintained and easy-to-use Java-based Japanese morphological analyzers, and these become many of our design goals for Kuromoji.

        Kuromoji also has a segmentation mode that is particularly useful for search, which we hope will interest Lucene and Solr users. Compound-nouns, such as 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are segmented as one token with most analyzers. As a result, a search for 空港 (airport) or 新聞 (newspaper) will not give you a for in these words. Kuromoji can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what you would want for search and you'll get a hit.

        We also wanted to make sure the technology has a license that makes it compatible with other Apache Software Foundation software to maximize its usefulness. Kuromoji has an Apache License 2.0 and all code is currently owned by Atilika Inc. The software has been developed by my good friend and ex-colleague Masaru Hasegawa and myself.

        Kuromoji uses the so-called IPADIC for its dictionary/statistical model and its license terms are described in NOTICE.txt.

        I'll upload code distributions and their corresponding hashes and I'd very much like to start the code grant process. I'm also happy to provide patches to integrate Kuromoji into the codebase, if you prefer that.

        Please advise on how you'd like me to proceed with this. Thank you.
        Christian Moen made changes -
        Attachment Kuromoji short overview .pdf [ 12486155 ]
        Christian Moen made changes -
        Attachment kuromoji-solr-0.5.3.tar.gz [ 12486154 ]
        Christian Moen made changes -
        Field Original Value New Value
        Attachment kuromoji-0.7.6.tar.gz [ 12486153 ]
        Christian Moen created issue -

          People

          • Assignee:
            Robert Muir
            Reporter:
            Christian Moen
          • Votes:
            6 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development