Details
-
Improvement
-
Status: Patch Available
-
Major
-
Resolution: Unresolved
-
3.6
-
None
-
None
-
Solr 3.6
UniDic 1.3.12 for MeCab (unidic-mecab1312src.tar.gz)
-
New
Description
I tried to build a UniDic dictionary for using it along with Kuromoji on Solr 3.6. I think UniDic is a good dictionary than IPA dictionary, so Kuromoji for Lucene/Solr should support UniDic dictionary as standalone Kuromoji does.
The following is my procedure:
Modified build.xml under lucene/contrib/analyzers/kuromoji directory and run 'ant build-dict', I got the error as the below.
build-dict:
[java] dictionary builder
[java]
[java] dictionary format: UNIDIC
[java] input directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/build/contrib/analyzers/kuromoji/unidic-mecab1312src
[java] output directory: /home/kazu/Work/src/solr/brunch_3_6/lucene/contrib/analyzers/kuromoji/src/resources
[java] input encoding: utf-8
[java] normalize entries: false
[java]
[java] building tokeninfo dict...
[java] parse...
[java] sort...
[java] Exception in thread "main" java.lang.AssertionError
[java] encode...
[java] at org.apache.lucene.analysis.ja.util.BinaryDictionaryWriter.put(BinaryDictionaryWriter.java:113)
[java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.buildDictionary(TokenInfoDictionaryBuilder.java:141)
[java] at org.apache.lucene.analysis.ja.util.TokenInfoDictionaryBuilder.build(TokenInfoDictionaryBuilder.java:76)
[java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.build(DictionaryBuilder.java:37)
[java] at org.apache.lucene.analysis.ja.util.DictionaryBuilder.main(DictionaryBuilder.java:82)
And the diff of build.xml:
===================================================================
— build.xml (revision 1338023)
+++ build.xml (working copy)
@@ -28,19 +28,31 @@
<property name="maven.dist.dir" location="../../../dist/maven" />
<!-- default configuration: uses mecab-ipadic -->
+ <!--
<property name="ipadic.version" value="mecab-ipadic-2.7.0-20070801" />
<property name="dict.src.file" value="${ipadic.version}.tar.gz" />
<property name="dict.url" value="http://mecab.googlecode.com/files/${dict.src.file}"/>
+ -->
<!-- alternative configuration: uses mecab-naist-jdic
<property name="ipadic.version" value="mecab-naist-jdic-0.6.3b-20111013" />
<property name="dict.src.file" value="${ipadic.version}.tar.gz" />
<property name="dict.url" value="http://sourceforge.jp/frs/redir.php?m=iij&f=/naist-jdic/53500/${dict.src.file}"/>
-->
+
+ <!-- alternative configuration: uses UniDic -->
+ <property name="ipadic.version" value="unidic-mecab1312src" />
+ <property name="dict.src.file" value="unidic-mecab1312src.tar.gz" />
+ <property name="dict.loc.dir" value="/home/kazu/Work/src/nlp/unidic/_archive"/>
+
<property name="dict.src.dir" value="${build.dir}/${ipadic.version}" />
+ <!--
<property name="dict.encoding" value="euc-jp"/>
<property name="dict.format" value="ipadic"/>
+ -->
+ <property name="dict.encoding" value="utf-8"/>
+ <property name="dict.format" value="unidic"/>
+
<property name="dict.normalize" value="false"/>
<property name="dict.target.dir" location="./src/resources"/>
@@ -58,7 +70,8 @@
<target name="compile-core" depends="jar-analyzers-common, common.compile-core" />
<target name="download-dict" unless="dict.available">
- <get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/>
+ <!-- get src="${dict.url}" dest="${build.dir}/${dict.src.file}"/ -->
+ <copy file="${dict.loc.dir}/${dict.src.file}" tofile="${build.dir}/${dict.src.file}"/>
<gunzip src="${build.dir}/${dict.src.file}"/>
<untar src="${build.dir}/${ipadic.version}.tar" dest="${build.dir}"/>
</target>
Attachments
Attachments
Issue Links
- is related to
-
LUCENE-8816 Decouple Kuromoji's morphological analyser and its dictionary
- Patch Available
- links to