Dear Lucene Korean Team,
I posted the following at sourceforge too. Thank you for your time. Would appreciate any inputs or assistance you can provide.
Dear Lucene Korean Team,
Hi, I'm a translator working with OmegaT and the OmegaT developers (see Yahoo! OmegaT group). Thank you all very much for the hard work you've put into this analyzer. I was so excited when I came across it!
As a result, I asked the OmegaT developers if they could include your Korean analyzer into OmegaT and they did. The unfortunate part is that the analyzer does not appear to be working. See the e-mails pasted below for more information.
And I would respectfully like to ask a few questions. Would you happen to know why this is happening? If there's a problem, do you know if it will be fixed in future releases? Finally, may I ask how this analyzer and the one here are related: https://issues.apache.org/jira/browse/LUCENE-4956
Thank you all in advance for your time.
I'm interested in adding a Korean-specific analyzer/tokenizer to OmT 3.0.8 because of the simplicity of the CJK tokenizer described in the RE. To that end, I downloaded KoreanAnalyzer-20100302.jar and, since I'm using a Mac, put in the .app lib folder and updated the Info.plist file to point to the new jar file.
Does anyone else know what needs to be done? How do I make OmT aware of the new analyzer and use it by default? I'd be very grateful for any assistance and apologize in advance if I don't know the difference between an analyzer and a tokenizer.
For those working in Korean, there's another apparently related analyzer, but I have no idea of how to work with it:
Good news and bad news. I built OmT with the new Korean analyzer that you so graciously added with no problems at all. However, the new Korean-only analyzer doesn't appear to be working as well as the CJK analyzer. I'm assuming analyzer/tokenizer differences will show up most noticeably in the Glossary pane. And that's where I'm seeing big differences.
For example, the simple sentence below
그 전문은 다음과 같다.
produces Transtips and Glossary hits using the CJK analyzer, but nothing with the new Korean-only analyzer. That was quite disappointing.
If there are any other tests you or anyone else can suggest or would like me to try, please let me know. I've never done this kind of testing before.
All the Best,
I just did a quick test of the KoreanAnalyzer lib and found that while the tokenizer seems to work fine, the analyzer part (which is used for glossary and Transtips, etc.) doesn't seem to work at all.
Input: "그 전문은 다음과 같다."
Tokenizer output: [ "그", "전문은", "다음과", "같다" ]
Analyzer output: [ ]
In other words the analyzer simply does not output anything, which means that no matches will be found.
I'm not sure what to make of this, as we are using the library in the same way as any other Lucene analyzer. This suggests to me that the code is broken; if there's some workaround then perhaps the author of the library can help us, but otherwise we will just have to wait until the standalone library is fixed or a final version is integrated into Lucene.
Actually, sorry, I was wrong; the analyzer's output is empty for the example sentence you supplied, but that is not true for the general case.
For a sentence I took from Wikipedia:
Input: "위키백과는 전 세계 여러 언어로 만들어 나가는 자유 백과사전으로, 누구나 참여하실 수 있습니다."
Tokenization: [ "위키백과는", "전", "세계", "여러", "언어로", "만들어", "나가는", "자유", "백과사전으로", "누구나", "참여하실", "수", "있습니다" ]
Analysis: [ "위키백과는", "위키백", "위키", "키백" ]
I thought at first this was the result of a very aggressive stopwords filter or something, but the result is the same even when supplying an empty stopwords set. Plus, Google Translate tells me that the analysis result is basically:
[ "Wikipedia", "Wikipedia", "Wiki", "pedia" ] (all substrings of the first token)
So it seems the conclusion is the same: The analysis is broken, or at least behaves completely differently from all standard Lucene analyzers.