Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-4056

Japanese Tokenizer (Kuromoji) cannot build UniDic dictionary

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.6
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Environment:

      Solr 3.6
      UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)

    • Lucene Fields:
      New

      Description

      I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.

      The following is my procedure:
      Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below.

      build-dict:
      [java] dictionary builder
      [java]
      [java] dictionary format: UNIDIC
      [java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
      [java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
      [java] input encoding: utf-8
      [java] normalize entries: false
      [java]
      [java] building tokeninfo dict...
      [java] parse...
      [java] sort...
      [java] Exception in thread "main" java.lang.AssertionError
      [java] encode...
      [java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
      [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
      [java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
      [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
      [java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)

      And the diff of build.xml:

      ===================================================================
      — build.xml (revision 1338023)
      +++ build.xml (working copy)
      @@ -28,19 +28,31 @@
      <property name="maven.dist.dir" location="../../../dist/maven" />

      <!-- default configuration: uses mecab-ipadic -->
      + <!--
      <property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
      <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
      <property name="dict.url" value="http://mecab.googlecode.com/files/${dict.src.file}"/>
      + -->

      <!-- alternative configuration: uses mecab-naist-jdic
      <property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
      <property name="dict.src.file" value="${ipadic.version}.tar.gz" />
      <property name="dict.url" value="http://sourceforge.jp/frs/redir.php?m=iij&f=/naist-jdic/53500/${dict.src.file}"/>
      -->

      • +
        + <!-- alternative configuration: uses UniDic -->
        + <property name="ipadic.version" value="unidic-mecab1312src" />
        + <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
        + <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
        +
        <property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
        + <!--
        <property name="dict.encoding" value="euc-jp"/>
        <property name="dict.format" value="ipadic"/>
        + -->
        + <property name="dict.encoding" value="utf-8"/>
        + <property name="dict.format" value="unidic"/>
        +
        <property name="dict.normalize" value="false"/>
        <property name="dict.target.dir" location="./src/resources"/>

      @@ -58,7 +70,8 @@

      <target name="compile-core" depends="jar-analyzers-common, common.compile-core" />
      <target name="download-dict" unless="dict.available">

      • <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
        + <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
        + <copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/>
        <gunzip src="${build.dir}/${dict.src.file}"/>
        <untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
        </target>

        Attachments

        1. LUCENE-4056.patch
          11 kB
          Jun Ohtani

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                h.kazuaki Kazuaki Hiraga
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m