Lucene - Core
  1. Lucene - Core
  2. LUCENE-4956

the korean analyzer that has a korean morphological analyzer and dictionaries

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 4.2
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
    • Lucene Fields:
      New

      Description

      Korean language has specific characteristic. When developing search service with lucene & solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer.

      1. eval.patch
        16 kB
        Robert Muir
      2. kr.analyzer.4x.tar
        2.32 MB
        SooMyung Lee
      3. lucene4956.patch
        25 kB
        SooMyung Lee
      4. lucene-4956.patch
        183 kB
        SooMyung Lee
      5. LUCENE-4956.patch
        2.23 MB
        Christian Moen

        Activity

        Hide
        Dai Deqi added a comment -

        Dear Aaron and All,

        I humbly apologize if I breached an unspoken internet/project protocol.  Such was not my intention.  In fact, I thought I was helping.  In any event, I'll be much more careful in the future.

        Very Respectfully,
        Dai Deqi

        On Saturday, February 8, 2014 6:50 PM, Aaron Madlon-Kay (JIRA) <jira@apache.org> wrote:

            [ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895773#comment-13895773]

        Aaron Madlon-Kay edited comment on LUCENE-4956 at 2/8/14 11:43 PM:
        -------------------------------------------------------------------

        Hello. I am involved with the OmegaT project; I am the "Aaron" referenced in Dai Deqi's quoted emails.

        We are using the version of this tokenizer that is hosted on SourceForge. Our issue should not have been brought up here at all. I apologize for the intrusion.

        I will follow up with SooMyung Lee privately.

        was (Author: amake):
        Hello. I am involved with the OmegaT project; I am the "Aaron" referenced in Dai Deqi's quoted emails.

        We are using the version of this tokenizer that is hosted on SourceForge. This issue should not have been brought up here at all. I apologize for the intrusion.

        I will follow up with SooMyung Lee privately.


        This message was sent by Atlassian JIRA
        (v6.1.5#6160)

        Show
        Dai Deqi added a comment - Dear Aaron and All, I humbly apologize if I breached an unspoken internet/project protocol.  Such was not my intention.  In fact, I thought I was helping.  In any event, I'll be much more careful in the future. Very Respectfully, Dai Deqi On Saturday, February 8, 2014 6:50 PM, Aaron Madlon-Kay (JIRA) <jira@apache.org> wrote:     [ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895773#comment-13895773 ] Aaron Madlon-Kay edited comment on LUCENE-4956 at 2/8/14 11:43 PM: ------------------------------------------------------------------- Hello. I am involved with the OmegaT project; I am the "Aaron" referenced in Dai Deqi 's quoted emails. We are using the version of this tokenizer that is hosted on SourceForge. Our issue should not have been brought up here at all. I apologize for the intrusion. I will follow up with SooMyung Lee privately. was (Author: amake): Hello. I am involved with the OmegaT project; I am the "Aaron" referenced in Dai Deqi 's quoted emails. We are using the version of this tokenizer that is hosted on SourceForge. This issue should not have been brought up here at all. I apologize for the intrusion. I will follow up with SooMyung Lee privately. – This message was sent by Atlassian JIRA (v6.1.5#6160)
        Hide
        Aaron Madlon-Kay added a comment - - edited

        Hello. I am involved with the OmegaT project; I am the "Aaron" referenced in Dai Deqi's quoted emails.

        We are using the version of this tokenizer that is hosted on SourceForge. Our issue should not have been brought up here at all. I apologize for the intrusion.

        I will follow up with SooMyung Lee privately.

        Show
        Aaron Madlon-Kay added a comment - - edited Hello. I am involved with the OmegaT project; I am the "Aaron" referenced in Dai Deqi 's quoted emails. We are using the version of this tokenizer that is hosted on SourceForge. Our issue should not have been brought up here at all. I apologize for the intrusion. I will follow up with SooMyung Lee privately.
        Hide
        SooMyung Lee added a comment -

        Hi Dai Deqi,
        I created SourceForge project and contribute it to Apache Lucene project.
        I'm working on fixing some problems mentioned in this Jira issue.
        But the problem you mentioned about "위키백과는" is to be solved if you add the word "위키" into the dictionary.
        There is no perfect way to analyze Korean sentences according to only a algorithm such as Porter Stemming in English sentence. So, we use both algorithm and dictionary to analyze Korean sentence. The dictionary for the Korean analyzer has around 40,000 words. I think most of the basic Korean word is included in it. But about many loan words such as "위키" is not included. Usually the user of Korean Analyzer in Korea builds his own dictionary for his purpose.

        Show
        SooMyung Lee added a comment - Hi Dai Deqi , I created SourceForge project and contribute it to Apache Lucene project. I'm working on fixing some problems mentioned in this Jira issue. But the problem you mentioned about "위키백과는" is to be solved if you add the word "위키" into the dictionary. There is no perfect way to analyze Korean sentences according to only a algorithm such as Porter Stemming in English sentence. So, we use both algorithm and dictionary to analyze Korean sentence. The dictionary for the Korean analyzer has around 40,000 words. I think most of the basic Korean word is included in it. But about many loan words such as "위키" is not included. Usually the user of Korean Analyzer in Korea builds his own dictionary for his purpose.
        Hide
        Benson Margulies added a comment -

        This is a patch, not an accepted component of Apache Lucene. There's no guarantee that anyone will work on it.

        Show
        Benson Margulies added a comment - This is a patch, not an accepted component of Apache Lucene. There's no guarantee that anyone will work on it.
        Hide
        Dai Deqi added a comment -

        Dear Lucene Korean Team,

        I posted the following at sourceforge too. Thank you for your time. Would appreciate any inputs or assistance you can provide.

        Respectfully,
        Deqi

        Dear Lucene Korean Team,
        Hi, I'm a translator working with OmegaT and the OmegaT developers (see Yahoo! OmegaT group). Thank you all very much for the hard work you've put into this analyzer. I was so excited when I came across it!
        As a result, I asked the OmegaT developers if they could include your Korean analyzer into OmegaT and they did. The unfortunate part is that the analyzer does not appear to be working. See the e-mails pasted below for more information.
        And I would respectfully like to ask a few questions. Would you happen to know why this is happening? If there's a problem, do you know if it will be fixed in future releases? Finally, may I ask how this analyzer and the one here are related: https://issues.apache.org/jira/browse/LUCENE-4956
        Thank you all in advance for your time.
        Respectfully,
        Deqi
        Dear Colleagues,
        RE: http://groups.yahoo.com/neo/groups/OmegaT/conversations/messages/20023
        I'm interested in adding a Korean-specific analyzer/tokenizer to OmT 3.0.8 because of the simplicity of the CJK tokenizer described in the RE. To that end, I downloaded KoreanAnalyzer-20100302.jar and, since I'm using a Mac, put in the .app lib folder and updated the Info.plist file to point to the new jar file.
        Does anyone else know what needs to be done? How do I make OmT aware of the new analyzer and use it by default? I'd be very grateful for any assistance and apologize in advance if I don't know the difference between an analyzer and a tokenizer.
        For those working in Korean, there's another apparently related analyzer, but I have no idea of how to work with it:
        https://issues.apache.org/jira/browse/LUCENE-4956
        V/R,
        Dai Deqi
        Hi Aaron,
        Good news and bad news. I built OmT with the new Korean analyzer that you so graciously added with no problems at all. However, the new Korean-only analyzer doesn't appear to be working as well as the CJK analyzer. I'm assuming analyzer/tokenizer differences will show up most noticeably in the Glossary pane. And that's where I'm seeing big differences.
        For example, the simple sentence below
        그 전문은 다음과 같다.
        produces Transtips and Glossary hits using the CJK analyzer, but nothing with the new Korean-only analyzer. That was quite disappointing.
        If there are any other tests you or anyone else can suggest or would like me to try, please let me know. I've never done this kind of testing before.
        All the Best,
        Dai Deqi
        Hello.
        I just did a quick test of the KoreanAnalyzer lib and found that while the tokenizer seems to work fine, the analyzer part (which is used for glossary and Transtips, etc.) doesn't seem to work at all.
        Input: "그 전문은 다음과 같다."
        Tokenizer output: [ "그", "전문은", "다음과", "같다" ]
        Analyzer output: [ ]
        In other words the analyzer simply does not output anything, which means that no matches will be found.
        I'm not sure what to make of this, as we are using the library in the same way as any other Lucene analyzer. This suggests to me that the code is broken; if there's some workaround then perhaps the author of the library can help us, but otherwise we will just have to wait until the standalone library is fixed or a final version is integrated into Lucene.
        -Aaron
        Actually, sorry, I was wrong; the analyzer's output is empty for the example sentence you supplied, but that is not true for the general case.
        For a sentence I took from Wikipedia:
        Input: "위키백과는 전 세계 여러 언어로 만들어 나가는 자유 백과사전으로, 누구나 참여하실 수 있습니다."
        Tokenization: [ "위키백과는", "전", "세계", "여러", "언어로", "만들어", "나가는", "자유", "백과사전으로", "누구나", "참여하실", "수", "있습니다" ]
        Analysis: [ "위키백과는", "위키백", "위키", "키백" ]
        I thought at first this was the result of a very aggressive stopwords filter or something, but the result is the same even when supplying an empty stopwords set. Plus, Google Translate tells me that the analysis result is basically:
        [ "Wikipedia", "Wikipedia", "Wiki", "pedia" ] (all substrings of the first token)
        So it seems the conclusion is the same: The analysis is broken, or at least behaves completely differently from all standard Lucene analyzers.
        -Aaron

        Show
        Dai Deqi added a comment - Dear Lucene Korean Team, I posted the following at sourceforge too. Thank you for your time. Would appreciate any inputs or assistance you can provide. Respectfully, Deqi Dear Lucene Korean Team, Hi, I'm a translator working with OmegaT and the OmegaT developers (see Yahoo! OmegaT group). Thank you all very much for the hard work you've put into this analyzer. I was so excited when I came across it! As a result, I asked the OmegaT developers if they could include your Korean analyzer into OmegaT and they did. The unfortunate part is that the analyzer does not appear to be working. See the e-mails pasted below for more information. And I would respectfully like to ask a few questions. Would you happen to know why this is happening? If there's a problem, do you know if it will be fixed in future releases? Finally, may I ask how this analyzer and the one here are related: https://issues.apache.org/jira/browse/LUCENE-4956 Thank you all in advance for your time. Respectfully, Deqi Dear Colleagues, RE: http://groups.yahoo.com/neo/groups/OmegaT/conversations/messages/20023 I'm interested in adding a Korean-specific analyzer/tokenizer to OmT 3.0.8 because of the simplicity of the CJK tokenizer described in the RE. To that end, I downloaded KoreanAnalyzer-20100302.jar and, since I'm using a Mac, put in the .app lib folder and updated the Info.plist file to point to the new jar file. Does anyone else know what needs to be done? How do I make OmT aware of the new analyzer and use it by default? I'd be very grateful for any assistance and apologize in advance if I don't know the difference between an analyzer and a tokenizer. For those working in Korean, there's another apparently related analyzer, but I have no idea of how to work with it: https://issues.apache.org/jira/browse/LUCENE-4956 V/R, Dai Deqi Hi Aaron, Good news and bad news. I built OmT with the new Korean analyzer that you so graciously added with no problems at all. However, the new Korean-only analyzer doesn't appear to be working as well as the CJK analyzer. I'm assuming analyzer/tokenizer differences will show up most noticeably in the Glossary pane. And that's where I'm seeing big differences. For example, the simple sentence below 그 전문은 다음과 같다. produces Transtips and Glossary hits using the CJK analyzer, but nothing with the new Korean-only analyzer. That was quite disappointing. If there are any other tests you or anyone else can suggest or would like me to try, please let me know. I've never done this kind of testing before. All the Best, Dai Deqi Hello. I just did a quick test of the KoreanAnalyzer lib and found that while the tokenizer seems to work fine, the analyzer part (which is used for glossary and Transtips, etc.) doesn't seem to work at all. Input: "그 전문은 다음과 같다." Tokenizer output: [ "그", "전문은", "다음과", "같다" ] Analyzer output: [ ] In other words the analyzer simply does not output anything, which means that no matches will be found. I'm not sure what to make of this, as we are using the library in the same way as any other Lucene analyzer. This suggests to me that the code is broken; if there's some workaround then perhaps the author of the library can help us, but otherwise we will just have to wait until the standalone library is fixed or a final version is integrated into Lucene. -Aaron Actually, sorry, I was wrong; the analyzer's output is empty for the example sentence you supplied, but that is not true for the general case. For a sentence I took from Wikipedia: Input: "위키백과는 전 세계 여러 언어로 만들어 나가는 자유 백과사전으로, 누구나 참여하실 수 있습니다." Tokenization: [ "위키백과는", "전", "세계", "여러", "언어로", "만들어", "나가는", "자유", "백과사전으로", "누구나", "참여하실", "수", "있습니다" ] Analysis: [ "위키백과는", "위키백", "위키", "키백" ] I thought at first this was the result of a very aggressive stopwords filter or something, but the result is the same even when supplying an empty stopwords set. Plus, Google Translate tells me that the analysis result is basically: [ "Wikipedia", "Wikipedia", "Wiki", "pedia" ] (all substrings of the first token) So it seems the conclusion is the same: The analysis is broken, or at least behaves completely differently from all standard Lucene analyzers. -Aaron
        Hide
        SooMyung Lee added a comment -

        Uwe Schindler I'm trying to change the code to use StandardTokenizer but I found a problem. when a text with Chinese characters is passed into the StandardTokenizer, It tokenizes Chinese characters into each character. That makes it difficult to extract index keywords and map Chinese character to Hangul Character. So, to use StandardTokenizer for KoreanAnalyzer, consecutive Chinese characters should not be tokenized.
        Can you change the StandardTokenizer as I mentioned?

        Show
        SooMyung Lee added a comment - Uwe Schindler I'm trying to change the code to use StandardTokenizer but I found a problem. when a text with Chinese characters is passed into the StandardTokenizer, It tokenizes Chinese characters into each character. That makes it difficult to extract index keywords and map Chinese character to Hangul Character. So, to use StandardTokenizer for KoreanAnalyzer, consecutive Chinese characters should not be tokenized. Can you change the StandardTokenizer as I mentioned?
        Hide
        SooMyung Lee added a comment - - edited

        Uwe Schindler Thank you for your comment, Uwe.
        I think I can make some improvements with above the problem like WordSpaceAnalyzer, posinc/offset and KoreanTokenizer. I'll upload the patch by this weekend.

        Show
        SooMyung Lee added a comment - - edited Uwe Schindler Thank you for your comment, Uwe. I think I can make some improvements with above the problem like WordSpaceAnalyzer, posinc/offset and KoreanTokenizer. I'll upload the patch by this weekend.
        Hide
        Uwe Schindler added a comment -

        There is one other problem we have to solve: The code ships with a slightly modified version of a very old version of StandardTokenizer, compiled with the not JVM invariant version of JFlex (the one which uses unicode tables from jvm crafting the source code).

        We should use the default StandardTokenizer and modify the filter to use the newly added types.

        Show
        Uwe Schindler added a comment - There is one other problem we have to solve: The code ships with a slightly modified version of a very old version of StandardTokenizer, compiled with the not JVM invariant version of JFlex (the one which uses unicode tables from jvm crafting the source code). We should use the default StandardTokenizer and modify the filter to use the newly added types.
        Hide
        Uwe Schindler added a comment -

        Hi,
        I have the same problems like Robert with some code parts. Partly the code is un-understandable and it looks like some places just have "workarounds" around silly bugs in the original code (like the catch ArrayIndexOutOfBoundsException and resuming with completely different code paths).

        Also the code generates like 5 completely new java.util.Collections (Lists, Maps,...) per token, without even reusing the previous ones!

        The code has lots of problems with offsets and positions (sometimes we workarounded using Math.max(0, positionFromCrazyCode). The code as it is will not pass TestRandomChains!

        Robert and I already rewrote lots of the code and also removed the GPL code. At this point it is still not in a state that can be committed to Lucene trunk or even backporting it.

        Show
        Uwe Schindler added a comment - Hi, I have the same problems like Robert with some code parts. Partly the code is un-understandable and it looks like some places just have "workarounds" around silly bugs in the original code (like the catch ArrayIndexOutOfBoundsException and resuming with completely different code paths). Also the code generates like 5 completely new java.util.Collections (Lists, Maps,...) per token, without even reusing the previous ones! The code has lots of problems with offsets and positions (sometimes we workarounded using Math.max(0, positionFromCrazyCode). The code as it is will not pass TestRandomChains! Robert and I already rewrote lots of the code and also removed the GPL code. At this point it is still not in a state that can be committed to Lucene trunk or even backporting it.
        Hide
        SooMyung Lee added a comment -

        Hi Robert,

        Thank you for your effort.

        Yes, I also found some problems in some case like WordSpaceAnalyzer and offset/posincs,
        and I made some improvement. so, I'll post the patch soon.

        I want to help you for the progress.
        but, I'm feeling strange when I work with you in this project. It is too difficult for me to figure out how I can help you.
        Please, let me know something that I can help you more detail. Is it helpful to you If I make some test cases ?

        Show
        SooMyung Lee added a comment - Hi Robert, Thank you for your effort. Yes, I also found some problems in some case like WordSpaceAnalyzer and offset/posincs, and I made some improvement. so, I'll post the patch soon. I want to help you for the progress. but, I'm feeling strange when I work with you in this project. It is too difficult for me to figure out how I can help you. Please, let me know something that I can help you more detail. Is it helpful to you If I make some test cases ?
        Hide
        Robert Muir added a comment -

        Yes, the nocommits.

        In general there are not enough tests to proceed fixing more things. I took it as far as I could with TestCoverageHack, but the stuff like the AIOOBE-catching is just the tip of the iceberg to some problems in WSOutput/WordSpaceAnalyzer.

        The main challenge here is just that, there are many many many special cases happening in the analysis logic. There needs to be good tests for these, rather than just testing that the analysis "does not change" because currently the analysis does really funky things in some situations and needs to change.

        TokenStream logic needs cleanup too: offsets/posincs and so on need to work and BaseTokenStreamTestCase.checkRandomData etc should pass.

        Show
        Robert Muir added a comment - Yes, the nocommits. In general there are not enough tests to proceed fixing more things. I took it as far as I could with TestCoverageHack, but the stuff like the AIOOBE-catching is just the tip of the iceberg to some problems in WSOutput/WordSpaceAnalyzer. The main challenge here is just that, there are many many many special cases happening in the analysis logic. There needs to be good tests for these, rather than just testing that the analysis "does not change" because currently the analysis does really funky things in some situations and needs to change. TokenStream logic needs cleanup too: offsets/posincs and so on need to work and BaseTokenStreamTestCase.checkRandomData etc should pass.
        Hide
        Steve Rowe added a comment -

        Robert Muir, Uwe Schindler, Christian Moen: is there anything blocking merging the branch into trunk and branch_4x?

        Show
        Steve Rowe added a comment - Robert Muir , Uwe Schindler , Christian Moen : is there anything blocking merging the branch into trunk and branch_4x?
        Hide
        Robert Muir added a comment -

        Just so anyone reading the thread knows: the clause Benson mentioned is not an advertising clause:

        Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in these Data Files or Software without prior written authorization of the copyright holder.

        The BSD advertising clause reads like this:

        All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the <organization>.

        These are very different.

        Show
        Robert Muir added a comment - Just so anyone reading the thread knows: the clause Benson mentioned is not an advertising clause: Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in these Data Files or Software without prior written authorization of the copyright holder. The BSD advertising clause reads like this: All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the <organization>. These are very different.
        Hide
        Benson Margulies added a comment -

        OK, I see, the email thread about Unicode data in general does certainly cover this. Sometimes the workings of Legal are pretty perplexing.

        Show
        Benson Margulies added a comment - OK, I see, the email thread about Unicode data in general does certainly cover this. Sometimes the workings of Legal are pretty perplexing.
        Hide
        Robert Muir added a comment -

        Rob, I got shat on at great length over this for merely test data over at the WS project. I had to make the build pull the data over the network to get certain directors off of my back. I'm trying to spare you the experience. That's all.

        Then perhaps you should push back hard when people don't know what they are talking about, like I do. As i said, the question about using unicode data tables has already been directly answered.

        As a low-intensity member of the UTC, I would also expect there to be only one license. However, I compare:

        I am also one. this means nothing.

        They look pretty different to me. Go figure?

        There is only one license from the terms of use page http://www.unicode.org/copyright.html

        That is what I include. Whoever created your "other license" decided to omit some of the information, which I did not.

        Show
        Robert Muir added a comment - Rob, I got shat on at great length over this for merely test data over at the WS project. I had to make the build pull the data over the network to get certain directors off of my back. I'm trying to spare you the experience. That's all. Then perhaps you should push back hard when people don't know what they are talking about, like I do. As i said, the question about using unicode data tables has already been directly answered. As a low-intensity member of the UTC, I would also expect there to be only one license. However, I compare: I am also one. this means nothing. They look pretty different to me. Go figure? There is only one license from the terms of use page http://www.unicode.org/copyright.html That is what I include. Whoever created your "other license" decided to omit some of the information, which I did not.
        Hide
        Benson Margulies added a comment -

        Rob, I got shat on at great length over this for merely test data over at the WS project. I had to make the build pull the data over the network to get certain directors off of my back. I'm trying to spare you the experience. That's all.

        As a low-intensity member of the UTC, I would also expect there to be only one license. However, I compare:

        #  Copyright (c) 1991-2011 Unicode, Inc. All Rights reserved.
        #  
        #  This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No
        #  claims are made as to fitness for any particular purpose. No warranties of
        #  any kind are expressed or implied. The recipient agrees to determine
        #  applicability of information provided. If this file has been provided on
        #  magnetic media by Unicode, Inc., the sole remedy for any claim will be
        #  exchange of defective media within 90 days of receipt.
        #  
        #  Unicode, Inc. hereby grants the right to freely use the information
        #  supplied in this file in the creation of products supporting the
        #  Unicode Standard, and to make copies of this file in any form for
        #  internal or external distribution as long as this notice remains
        #  attached.
        

        with

        ! Copyright (c) 1991-2013 Unicode, Inc. 
        ! All rights reserved. 
        ! Distributed under the Terms of Use in http://www.unicode.org/copyright.html.
        !
        ! Permission is hereby granted, free of charge, to any person obtaining a copy 
        ! of the Unicode data files and any associated documentation (the "Data Files") 
        ! or Unicode software and any associated documentation (the "Software") to deal 
        ! in the Data Files or Software without restriction, including without limitation 
        ! the rights to use, copy, modify, merge, publish, distribute, and/or sell copies 
        ! of the Data Files or Software, and to permit persons to whom the Data Files or 
        ! Software are furnished to do so, provided that (a) the above copyright notice(s) 
        ! and this permission notice appear with all copies of the Data Files or Software, 
        ! (b) both the above copyright notice(s) and this permission notice appear in 
        ! associated documentation, and (c) there is clear notice in each modified Data 
        ! File or in the Software as well as in the documentation associated with the Data 
        ! File(s) or Software that the data or software has been modified.
        !
        ! THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, 
        ! EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
        ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO 
        ! EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR 
        ! ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES 
        ! WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF 
        ! CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION 
        ! WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE.
        ! 
        ! Except as contained in this notice, the name of a copyright holder shall not be 
        ! used in advertising or otherwise to promote the sale, use or other dealings in 
        ! these Data Files or Software without prior written authorization of the copyright holder.
        

        They look pretty different to me. Go figure?

        Show
        Benson Margulies added a comment - Rob, I got shat on at great length over this for merely test data over at the WS project. I had to make the build pull the data over the network to get certain directors off of my back. I'm trying to spare you the experience. That's all. As a low-intensity member of the UTC, I would also expect there to be only one license. However, I compare: # Copyright (c) 1991-2011 Unicode, Inc. All Rights reserved. # # This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No # claims are made as to fitness for any particular purpose. No warranties of # any kind are expressed or implied. The recipient agrees to determine # applicability of information provided. If this file has been provided on # magnetic media by Unicode, Inc., the sole remedy for any claim will be # exchange of defective media within 90 days of receipt. # # Unicode, Inc. hereby grants the right to freely use the information # supplied in this file in the creation of products supporting the # Unicode Standard, and to make copies of this file in any form for # internal or external distribution as long as this notice remains # attached. with ! Copyright (c) 1991-2013 Unicode, Inc. ! All rights reserved. ! Distributed under the Terms of Use in http://www.unicode.org/copyright.html. ! ! Permission is hereby granted, free of charge, to any person obtaining a copy ! of the Unicode data files and any associated documentation (the "Data Files") ! or Unicode software and any associated documentation (the "Software") to deal ! in the Data Files or Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, and/or sell copies ! of the Data Files or Software, and to permit persons to whom the Data Files or ! Software are furnished to do so, provided that (a) the above copyright notice(s) ! and this permission notice appear with all copies of the Data Files or Software, ! (b) both the above copyright notice(s) and this permission notice appear in ! associated documentation, and (c) there is clear notice in each modified Data ! File or in the Software as well as in the documentation associated with the Data ! File(s) or Software that the data or software has been modified. ! ! THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, ! EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO ! EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ! ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES ! WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF ! CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION ! WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE. ! ! Except as contained in this notice, the name of a copyright holder shall not be ! used in advertising or otherwise to promote the sale, use or other dealings in ! these Data Files or Software without prior written authorization of the copyright holder. They look pretty different to me. Go figure?
        Hide
        Robert Muir added a comment -

        The exact question about using unicode data tables has been answered explicitly already:

        http://mail-archives.apache.org/mod_mbox/www-legal-discuss/200903.mbox/%3C3d4032300903030415w4831f6e4u65c12881cbb8642c@mail.gmail.com%3E

        I don't think it needs any further discussion

        Show
        Robert Muir added a comment - The exact question about using unicode data tables has been answered explicitly already: http://mail-archives.apache.org/mod_mbox/www-legal-discuss/200903.mbox/%3C3d4032300903030415w4831f6e4u65c12881cbb8642c@mail.gmail.com%3E I don't think it needs any further discussion
        Hide
        Robert Muir added a comment -

        This is the unicode license that all of their data and code comes from. There is only one.

        Please, don't waste my time here, if you want to waste the legal team's time, thats ok

        Show
        Robert Muir added a comment - This is the unicode license that all of their data and code comes from. There is only one. Please, don't waste my time here, if you want to waste the legal team's time, thats ok
        Hide
        Benson Margulies added a comment - - edited

        That jira concerns a different license. The license on the file pointed-to there has no advertising clause that I can spot. Which isn't to say that legal would have a problem with this, just that I don't think that the JIRA in question tells us.

        Show
        Benson Margulies added a comment - - edited That jira concerns a different license. The license on the file pointed-to there has no advertising clause that I can spot. Which isn't to say that legal would have a problem with this, just that I don't think that the JIRA in question tells us.
        Hide
        Robert Muir added a comment -

        So that Unicode license is possibly an issue.

        No, its not. https://issues.apache.org/jira/browse/LEGAL-108

        Show
        Robert Muir added a comment - So that Unicode license is possibly an issue. No, its not. https://issues.apache.org/jira/browse/LEGAL-108
        Hide
        Benson Margulies added a comment -

        My point is that it might have a bit too much legal notice. Generally, when someone grants a license, the headers all move up to some global NOTICE file, and the file is left with just an Apache license.

        I also noted the following:

        ! Except as contained in this notice, the name of a copyright holder shall not be
        ! used in advertising or otherwise to promote the sale, use or other dealings in
        ! these Data Files or Software without prior written authorization of the copyright holder.

        and then noticed:

        that http://www.apache.org/legal/resolved.html says that it approves of

        • BSD (without advertising clause).

        So that Unicode license is possibly an issue.

        Right now I'm using the git clone, but I just did a pull, and the pathname is lucene/analysis/arirang/src/data/mapHanja.dic

        Show
        Benson Margulies added a comment - My point is that it might have a bit too much legal notice. Generally, when someone grants a license, the headers all move up to some global NOTICE file, and the file is left with just an Apache license. I also noted the following: ! Except as contained in this notice, the name of a copyright holder shall not be ! used in advertising or otherwise to promote the sale, use or other dealings in ! these Data Files or Software without prior written authorization of the copyright holder. and then noticed: that http://www.apache.org/legal/resolved.html says that it approves of BSD (without advertising clause). So that Unicode license is possibly an issue. Right now I'm using the git clone, but I just did a pull, and the pathname is lucene/analysis/arirang/src/data/mapHanja.dic
        Hide
        Robert Muir added a comment -

        please point to specific files in svn that you have concerns about.

        I recreated this file myself from clearly attributed sources, from scratch.

        It has MORE THAN ENOUGH legal notice.

        Show
        Robert Muir added a comment - please point to specific files in svn that you have concerns about. I recreated this file myself from clearly attributed sources, from scratch. It has MORE THAN ENOUGH legal notice.
        Hide
        Benson Margulies added a comment -

        Looks like mapHanja.dic needs some adjustment of the legal notice? Or was this going to be replaced?

        Show
        Benson Margulies added a comment - Looks like mapHanja.dic needs some adjustment of the legal notice? Or was this going to be replaced?
        Hide
        Robert Muir added a comment -

        nothing is funny, i renamed them locally, sorry.

        Show
        Robert Muir added a comment - nothing is funny, i renamed them locally, sorry.
        Hide
        Benson Margulies added a comment - - edited

        Something's funny here. On this page (http://www.kristalinfo.com/TestCollections/), the zip file has directories like

        HANTEC-2.0/relevance_file/과학기술분야/
        HANTEC-2.0/relevance_file/전체/

        The first translates as 'Science and Technology' and the second as 'All'.

        The code in the patch expects the word 'full' in latin-alphabet, no funny full-width, in the that intermediate directory. So I don't see how a code-page option to unzip got there. I'm suspecting that an 'mv' is in order.

        Show
        Benson Margulies added a comment - - edited Something's funny here. On this page ( http://www.kristalinfo.com/TestCollections/ ), the zip file has directories like HANTEC-2.0/relevance_file/과학기술분야/ HANTEC-2.0/relevance_file/전체/ The first translates as 'Science and Technology' and the second as 'All'. The code in the patch expects the word 'full' in latin-alphabet, no funny full-width, in the that intermediate directory. So I don't see how a code-page option to unzip got there. I'm suspecting that an 'mv' is in order.
        Hide
        Benson Margulies added a comment -

        Hmm. When I followed the link, I found a .tar.gz. I guess the zip was further down the page.

        Show
        Benson Margulies added a comment - Hmm. When I followed the link, I found a .tar.gz. I guess the zip was further down the page.
        Hide
        Robert Muir added a comment -

        unzip -O cp949 HANTEC-2.0.zip

        Show
        Robert Muir added a comment - unzip -O cp949 HANTEC-2.0.zip
        Hide
        Benson Margulies added a comment -

        Could you share the trick of unpacking the big tarball, locale-wise? I ended up with:

        [benson] /data/HANTEC-2.0 % ls relevance_file
        %B0%FA%C7б%E2%BC%FA%BAо%DF %C0%FCü

        which does not work so well.

        Did you set LOCALE to something before unpacking?

        Show
        Benson Margulies added a comment - Could you share the trick of unpacking the big tarball, locale-wise? I ended up with: [benson] /data/HANTEC-2.0 % ls relevance_file %B0%FA%C7б%E2%BC%FA%BAо%DF %C0%FCü which does not work so well. Did you set LOCALE to something before unpacking?
        Hide
        ASF subversion and git services added a comment -

        Commit 1536244 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536244 ]

        LUCENE-4956: more cleanups and remove n^2 in filterIncorrect

        Show
        ASF subversion and git services added a comment - Commit 1536244 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536244 ] LUCENE-4956 : more cleanups and remove n^2 in filterIncorrect
        Hide
        ASF subversion and git services added a comment -

        Commit 1536235 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536235 ]

        LUCENE-4956: improve the runtime of maxWord

        Show
        ASF subversion and git services added a comment - Commit 1536235 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536235 ] LUCENE-4956 : improve the runtime of maxWord
        Hide
        ASF subversion and git services added a comment -

        Commit 1536234 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536234 ]

        LUCENE-4956: more compound cleanups

        Show
        ASF subversion and git services added a comment - Commit 1536234 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536234 ] LUCENE-4956 : more compound cleanups
        Hide
        ASF subversion and git services added a comment -

        Commit 1536233 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536233 ]

        LUCENE-4956: move all the list creation out of compoundnounanalyzer

        Show
        ASF subversion and git services added a comment - Commit 1536233 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536233 ] LUCENE-4956 : move all the list creation out of compoundnounanalyzer
        Hide
        ASF subversion and git services added a comment -

        Commit 1536231 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536231 ]

        LUCENE-4956: more refactoring of decompounding

        Show
        ASF subversion and git services added a comment - Commit 1536231 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536231 ] LUCENE-4956 : more refactoring of decompounding
        Hide
        ASF subversion and git services added a comment -

        Commit 1536214 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536214 ]

        LUCENE-4956: more speedup,style,refactoring

        Show
        ASF subversion and git services added a comment - Commit 1536214 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536214 ] LUCENE-4956 : more speedup,style,refactoring
        Hide
        ASF subversion and git services added a comment -

        Commit 1536184 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536184 ]

        LUCENE-4956: replace some getWord != null with hasWord

        Show
        ASF subversion and git services added a comment - Commit 1536184 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536184 ] LUCENE-4956 : replace some getWord != null with hasWord
        Hide
        ASF subversion and git services added a comment -

        Commit 1536174 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1536174 ]

        LUCENE-4956: move this out of dictionaryutil

        Show
        ASF subversion and git services added a comment - Commit 1536174 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1536174 ] LUCENE-4956 : move this out of dictionaryutil
        Hide
        ASF subversion and git services added a comment -

        Commit 1534514 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534514 ]

        LUCENE-4956: More optimization on captureState

        Show
        ASF subversion and git services added a comment - Commit 1534514 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534514 ] LUCENE-4956 : More optimization on captureState
        Hide
        ASF subversion and git services added a comment -

        Commit 1534477 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534477 ]

        LUCENE-4956: don't captureState unless we have to

        Show
        ASF subversion and git services added a comment - Commit 1534477 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534477 ] LUCENE-4956 : don't captureState unless we have to
        Hide
        ASF subversion and git services added a comment -

        Commit 1534473 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534473 ]

        LUCENE-4956: pull out broken acronym/etc handling, user can just use classicfilter for that

        Show
        ASF subversion and git services added a comment - Commit 1534473 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534473 ] LUCENE-4956 : pull out broken acronym/etc handling, user can just use classicfilter for that
        Hide
        ASF subversion and git services added a comment -

        Commit 1534472 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534472 ]

        LUCENE-4956: do this simpler/faster like kuromoji

        Show
        ASF subversion and git services added a comment - Commit 1534472 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534472 ] LUCENE-4956 : do this simpler/faster like kuromoji
        Hide
        ASF subversion and git services added a comment -

        Commit 1534364 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534364 ]

        LUCENE-4956: use a byte1 jamo FST, smaller and much faster

        Show
        ASF subversion and git services added a comment - Commit 1534364 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534364 ] LUCENE-4956 : use a byte1 jamo FST, smaller and much faster
        Hide
        ASF subversion and git services added a comment -

        Commit 1534141 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534141 ]

        LUCENE-4956: add some cleanups, remove packing, add missing close, lazy-load compound data until you ask for it

        Show
        ASF subversion and git services added a comment - Commit 1534141 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534141 ] LUCENE-4956 : add some cleanups, remove packing, add missing close, lazy-load compound data until you ask for it
        Hide
        ASF subversion and git services added a comment -

        Commit 1534135 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534135 ]

        LUCENE-4956: Fix file not found case, add close on finally

        Show
        ASF subversion and git services added a comment - Commit 1534135 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534135 ] LUCENE-4956 : Fix file not found case, add close on finally
        Hide
        ASF subversion and git services added a comment -

        Commit 1534128 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534128 ]

        LUCENE-4956: remove trie

        Show
        ASF subversion and git services added a comment - Commit 1534128 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534128 ] LUCENE-4956 : remove trie
        Hide
        ASF subversion and git services added a comment -

        Commit 1534115 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534115 ]

        LUCENE-4956: commit working state

        Show
        ASF subversion and git services added a comment - Commit 1534115 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534115 ] LUCENE-4956 : commit working state
        Hide
        ASF subversion and git services added a comment -

        Commit 1534040 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534040 ]

        LUCENE-4956: don't hold thousands of arrays in dictionary

        Show
        ASF subversion and git services added a comment - Commit 1534040 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534040 ] LUCENE-4956 : don't hold thousands of arrays in dictionary
        Hide
        ASF subversion and git services added a comment -

        Commit 1534032 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534032 ]

        LUCENE-4956: don't use wordentry for uncompound processing

        Show
        ASF subversion and git services added a comment - Commit 1534032 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534032 ] LUCENE-4956 : don't use wordentry for uncompound processing
        Hide
        ASF subversion and git services added a comment -

        Commit 1534030 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534030 ]

        LUCENE-4956: move dictionary entry classes to dictionary package

        Show
        ASF subversion and git services added a comment - Commit 1534030 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534030 ] LUCENE-4956 : move dictionary entry classes to dictionary package
        Hide
        ASF subversion and git services added a comment -

        Commit 1534029 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534029 ]

        LUCENE-4956: more morph cleanups

        Show
        ASF subversion and git services added a comment - Commit 1534029 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534029 ] LUCENE-4956 : more morph cleanups
        Hide
        ASF subversion and git services added a comment -

        Commit 1534021 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1534021 ]

        LUCENE-4956: clean up compound / feature processing a bit (more coming)

        Show
        ASF subversion and git services added a comment - Commit 1534021 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1534021 ] LUCENE-4956 : clean up compound / feature processing a bit (more coming)
        Hide
        ASF subversion and git services added a comment -

        Commit 1533923 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533923 ]

        LUCENE-4956: Rewrite iterator-consumer; make "stupid" Exception more selective. I have no idea how to fix!

        Show
        ASF subversion and git services added a comment - Commit 1533923 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533923 ] LUCENE-4956 : Rewrite iterator-consumer; make "stupid" Exception more selective. I have no idea how to fix!
        Hide
        ASF subversion and git services added a comment -

        Commit 1533877 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533877 ]

        LUCENE-4956: ban dictionary corrumption

        Show
        ASF subversion and git services added a comment - Commit 1533877 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533877 ] LUCENE-4956 : ban dictionary corrumption
        Hide
        ASF subversion and git services added a comment -

        Commit 1533872 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533872 ]

        LUCENE-4956: remove slow caseless match in trie, don't read headers as actual entries

        Show
        ASF subversion and git services added a comment - Commit 1533872 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533872 ] LUCENE-4956 : remove slow caseless match in trie, don't read headers as actual entries
        Hide
        ASF subversion and git services added a comment -

        Commit 1533865 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533865 ]

        LUCENE-4956: fix malformed entries

        Show
        ASF subversion and git services added a comment - Commit 1533865 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533865 ] LUCENE-4956 : fix malformed entries
        Hide
        ASF subversion and git services added a comment -

        Commit 1533863 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533863 ]

        LUCENE-4956: Remove useless getters in private class

        Show
        ASF subversion and git services added a comment - Commit 1533863 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533863 ] LUCENE-4956 : Remove useless getters in private class
        Hide
        ASF subversion and git services added a comment -

        Commit 1533862 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533862 ]

        LUCENE-4956: Simplier fix for the broken posIncr. I also cleaned up the Token class and made private to the Filter

        Show
        ASF subversion and git services added a comment - Commit 1533862 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533862 ] LUCENE-4956 : Simplier fix for the broken posIncr. I also cleaned up the Token class and made private to the Filter
        Hide
        ASF subversion and git services added a comment -

        Commit 1533858 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533858 ]

        LUCENE-4956: Rename IndexWord to Token like in Kuromoji!

        Show
        ASF subversion and git services added a comment - Commit 1533858 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533858 ] LUCENE-4956 : Rename IndexWord to Token like in Kuromoji!
        Hide
        ASF subversion and git services added a comment -

        Commit 1533857 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533857 ]

        LUCENE-4956: Partially fix posIncrAtt to preserve increment of first token. The morphQueue still has a bug, added nocommit!

        Show
        ASF subversion and git services added a comment - Commit 1533857 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533857 ] LUCENE-4956 : Partially fix posIncrAtt to preserve increment of first token. The morphQueue still has a bug, added nocommit!
        Hide
        ASF subversion and git services added a comment -

        Commit 1533846 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533846 ]

        LUCENE-4956: Fix factory to check for incorrect parameter keys, remove bogus parameters, remove bogus matchVersion on KoreanTokenizer

        Show
        ASF subversion and git services added a comment - Commit 1533846 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533846 ] LUCENE-4956 : Fix factory to check for incorrect parameter keys, remove bogus parameters, remove bogus matchVersion on KoreanTokenizer
        Hide
        ASF subversion and git services added a comment -

        Commit 1533843 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533843 ]

        LUCENE-4956: Make filter final, add one more nocommit

        Show
        ASF subversion and git services added a comment - Commit 1533843 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533843 ] LUCENE-4956 : Make filter final, add one more nocommit
        Hide
        ASF subversion and git services added a comment -

        Commit 1533842 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533842 ]

        LUCENE-4956: Fix stopwords file, Cleanup analyzer (load stopwords file, no hardcoded stops), and filter (fix broken incrementToken, implement reset), remove unused varaibles in CompoundNounAnalyzer

        Show
        ASF subversion and git services added a comment - Commit 1533842 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533842 ] LUCENE-4956 : Fix stopwords file, Cleanup analyzer (load stopwords file, no hardcoded stops), and filter (fix broken incrementToken, implement reset), remove unused varaibles in CompoundNounAnalyzer
        Hide
        ASF subversion and git services added a comment -

        Commit 1533838 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533838 ]

        LUCENE-4956: Move/Rename some files and make pkg-private

        Show
        ASF subversion and git services added a comment - Commit 1533838 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533838 ] LUCENE-4956 : Move/Rename some files and make pkg-private
        Hide
        ASF subversion and git services added a comment -

        Commit 1533835 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533835 ]

        LUCENE-4956: more cleanups and visibility fixes

        Show
        ASF subversion and git services added a comment - Commit 1533835 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533835 ] LUCENE-4956 : more cleanups and visibility fixes
        Hide
        ASF subversion and git services added a comment -

        Commit 1533821 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533821 ]

        LUCENE-4956: Use IOUtils.decodingReader to load data files

        Show
        ASF subversion and git services added a comment - Commit 1533821 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533821 ] LUCENE-4956 : Use IOUtils.decodingReader to load data files
        Hide
        ASF subversion and git services added a comment -

        Commit 1533817 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533817 ]

        LUCENE-4956: Fix error handling in HanjaUtil to prevent NPE on broken classpath

        Show
        ASF subversion and git services added a comment - Commit 1533817 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533817 ] LUCENE-4956 : Fix error handling in HanjaUtil to prevent NPE on broken classpath
        Hide
        ASF subversion and git services added a comment -

        Commit 1533815 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533815 ]

        LUCENE-4956: Fix error handling in HanjaUtil to prevent NPE on broken classpath

        Show
        ASF subversion and git services added a comment - Commit 1533815 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533815 ] LUCENE-4956 : Fix error handling in HanjaUtil to prevent NPE on broken classpath
        Hide
        ASF subversion and git services added a comment -

        Commit 1533813 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533813 ]

        LUCENE-4956: refactor syllable handling to not be a list of thousands of arrays

        Show
        ASF subversion and git services added a comment - Commit 1533813 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533813 ] LUCENE-4956 : refactor syllable handling to not be a list of thousands of arrays
        Hide
        ASF subversion and git services added a comment -

        Commit 1533781 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533781 ]

        LUCENE-4956: allow use of these with datainput

        Show
        ASF subversion and git services added a comment - Commit 1533781 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533781 ] LUCENE-4956 : allow use of these with datainput
        Hide
        ASF subversion and git services added a comment -

        Commit 1533709 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533709 ]

        LUCENE-4956: Remove unused file constants

        Show
        ASF subversion and git services added a comment - Commit 1533709 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533709 ] LUCENE-4956 : Remove unused file constants
        Hide
        ASF subversion and git services added a comment -

        Commit 1533695 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533695 ]

        LUCENE-4956: move data to src/data and setup regeneration (for now simple copy)

        Show
        ASF subversion and git services added a comment - Commit 1533695 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533695 ] LUCENE-4956 : move data to src/data and setup regeneration (for now simple copy)
        Hide
        ASF subversion and git services added a comment -

        Commit 1533562 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533562 ]

        LUCENE-4956: remove some dead code

        Show
        ASF subversion and git services added a comment - Commit 1533562 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533562 ] LUCENE-4956 : remove some dead code
        Hide
        ASF subversion and git services added a comment -

        Commit 1533557 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533557 ]

        LUCENE-4956: remove empty dir

        Show
        ASF subversion and git services added a comment - Commit 1533557 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533557 ] LUCENE-4956 : remove empty dir
        Hide
        ASF subversion and git services added a comment -

        Commit 1533550 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533550 ]

        LUCENE-4956: Tagger is completely dead code! Why did I put work into it?

        Show
        ASF subversion and git services added a comment - Commit 1533550 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533550 ] LUCENE-4956 : Tagger is completely dead code! Why did I put work into it?
        Hide
        ASF subversion and git services added a comment -

        Commit 1533549 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533549 ]

        LUCENE-4956: add TestCoverageHack

        Show
        ASF subversion and git services added a comment - Commit 1533549 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533549 ] LUCENE-4956 : add TestCoverageHack
        Hide
        ASF subversion and git services added a comment -

        Commit 1533521 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533521 ]

        LUCENE-4956: Unify Exceptions

        Show
        ASF subversion and git services added a comment - Commit 1533521 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533521 ] LUCENE-4956 : Unify Exceptions
        Hide
        ASF subversion and git services added a comment -

        Commit 1533517 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533517 ]

        LUCENE-4956: Make parser more strict, remove bullshit from data files

        Show
        ASF subversion and git services added a comment - Commit 1533517 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533517 ] LUCENE-4956 : Make parser more strict, remove bullshit from data files
        Hide
        Uwe Schindler added a comment -

        So I would suggest to replace these 3 methods by an FST backing them.

        This one is already gone: Tagger#getGR(prefix)

        So the big dictionary is the thing to do.

        Show
        Uwe Schindler added a comment - So I would suggest to replace these 3 methods by an FST backing them. This one is already gone: Tagger#getGR(prefix) So the big dictionary is the thing to do.
        Hide
        Robert Muir added a comment -

        So I would suggest to replace these 3 methods by an FST backing them.

        I'm starting on this today.

        Show
        Robert Muir added a comment - So I would suggest to replace these 3 methods by an FST backing them. I'm starting on this today.
        Hide
        ASF subversion and git services added a comment -

        Commit 1533403 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533403 ]

        LUCENE-4956: Move empty line removal up

        Show
        ASF subversion and git services added a comment - Commit 1533403 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533403 ] LUCENE-4956 : Move empty line removal up
        Hide
        ASF subversion and git services added a comment -

        Commit 1533382 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533382 ]

        LUCENE-4956: Don't load full files into big List<>, instead process them line by line (the current code uses iterator anyway).

        Show
        ASF subversion and git services added a comment - Commit 1533382 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533382 ] LUCENE-4956 : Don't load full files into big List<>, instead process them line by line (the current code uses iterator anyway).
        Hide
        ASF subversion and git services added a comment -

        Commit 1533378 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533378 ]

        LUCENE-4956: Remove debug output, make map unmodifiable

        Show
        ASF subversion and git services added a comment - Commit 1533378 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533378 ] LUCENE-4956 : Remove debug output, make map unmodifiable
        Hide
        ASF subversion and git services added a comment -

        Commit 1533371 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533371 ]

        LUCENE-4956: Quick fix to remove the Trie for the Tagger. The file is very slow and a TreeMap is perfectly fine. Can still be improved, but the primary concern is to remove Trie.java

        Show
        ASF subversion and git services added a comment - Commit 1533371 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533371 ] LUCENE-4956 : Quick fix to remove the Trie for the Tagger. The file is very slow and a TreeMap is perfectly fine. Can still be improved, but the primary concern is to remove Trie.java
        Hide
        ASF subversion and git services added a comment -

        Commit 1533362 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533362 ]

        LUCENE-4956: Make WordEntry components final

        Show
        ASF subversion and git services added a comment - Commit 1533362 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533362 ] LUCENE-4956 : Make WordEntry components final
        Hide
        ASF subversion and git services added a comment -

        Commit 1533358 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533358 ]

        LUCENE-4956: Reorder loading

        Show
        ASF subversion and git services added a comment - Commit 1533358 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533358 ] LUCENE-4956 : Reorder loading
        Hide
        ASF subversion and git services added a comment -

        Commit 1533355 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533355 ]

        LUCENE-4956: Remove useless synchronization (no lazy loading anymore)

        Show
        ASF subversion and git services added a comment - Commit 1533355 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533355 ] LUCENE-4956 : Remove useless synchronization (no lazy loading anymore)
        Hide
        Uwe Schindler added a comment -

        Hi,
        I reviewed the Trie.java code and its usage yesterday. Trie.java is only used at 2 places with same usage pattern:

        • DictionaryUtil#dictionary
        • Tagger#occurences

        In both cases there are only 2 types of matches:

        • DictionaryUtil#findWithPrefix: returns an Iterator of all entries with a given prefix
        • DictionaryUtil#getWord: returns WordEntry for an exact match
        • Tagger#getGR: returns iterator of all entries with a given prefix

        These use cases are not really the ones a Trie is made for, so the ideal and most performant would be to USE Lucene's FST implementation. We would also get an Iterator-like interface to look up prefixes. So I would suggest to replace these 3 methods by an FST backing them. The dictionary would then (like for kuromoji) be preprocessed and saved as serialized FST in the resource file. The original dictionary as text file would only be available in the Lucene source distribution to regenerate the FST.

        Show
        Uwe Schindler added a comment - Hi, I reviewed the Trie.java code and its usage yesterday. Trie.java is only used at 2 places with same usage pattern: DictionaryUtil#dictionary Tagger#occurences In both cases there are only 2 types of matches: DictionaryUtil#findWithPrefix: returns an Iterator of all entries with a given prefix DictionaryUtil#getWord: returns WordEntry for an exact match Tagger#getGR: returns iterator of all entries with a given prefix These use cases are not really the ones a Trie is made for, so the ideal and most performant would be to USE Lucene's FST implementation. We would also get an Iterator-like interface to look up prefixes. So I would suggest to replace these 3 methods by an FST backing them. The dictionary would then (like for kuromoji) be preprocessed and saved as serialized FST in the resource file. The original dictionary as text file would only be available in the Lucene source distribution to regenerate the FST.
        Hide
        SooMyung Lee added a comment -

        Robert Muir, Thanks again.

        I have run a test case with new hanja-hangul mapping files. It works very well.

        Show
        SooMyung Lee added a comment - Robert Muir , Thanks again. I have run a test case with new hanja-hangul mapping files. It works very well.
        Show
        Robert Muir added a comment - OK, see new file here: http://svn.apache.org/viewvc/lucene/dev/branches/lucene4956/lucene/analysis/arirang/src/resources/org/apache/lucene/analysis/ko/dic/mapHanja.dic?revision=1533329&view=markup Generated with: http://svn.apache.org/viewvc/lucene/dev/branches/lucene4956/lucene/analysis/arirang/src/tools/java/org/apache/lucene/analysis/ko/GenerateHanjaMap.java?revision=1533329&view=markup hanja keys: 27784 hanja/hangul mappings: 28861
        Hide
        ASF subversion and git services added a comment -

        Commit 1533329 from Robert Muir in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533329 ]

        LUCENE-4956: generate new mapHanja.dic from sources with clear license

        Show
        ASF subversion and git services added a comment - Commit 1533329 from Robert Muir in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533329 ] LUCENE-4956 : generate new mapHanja.dic from sources with clear license
        Hide
        SooMyung Lee added a comment -

        Great, thanks a lot !

        Show
        SooMyung Lee added a comment - Great, thanks a lot !
        Hide
        Robert Muir added a comment -

        Hi, I think we have some good sources with acceptable licenses.

        I will add my processing to tools/ and generate a new mapHanja.dic from sources with clear licenses, and you can then review and help clean up afterwards. I had to first understand precisely how it was being used in analysisChinese but now I get it

        I will reply back soon.

        Show
        Robert Muir added a comment - Hi, I think we have some good sources with acceptable licenses. I will add my processing to tools/ and generate a new mapHanja.dic from sources with clear licenses, and you can then review and help clean up afterwards. I had to first understand precisely how it was being used in analysisChinese but now I get it I will reply back soon.
        Hide
        SooMyung Lee added a comment - - edited

        Hi Robert Muir,

        Thank you for your comment. I can reconstitute the hanja-hangul mappings file by myself if we cannot find other sources with clear licenses. I can easily get hanja list that often appear in Korean sentence. after then I'll look up online dictionary. I can start with 3,000~4,000 hanjas that is most often appeared in Korean sentences.

        Show
        SooMyung Lee added a comment - - edited Hi Robert Muir , Thank you for your comment. I can reconstitute the hanja-hangul mappings file by myself if we cannot find other sources with clear licenses. I can easily get hanja list that often appear in Korean sentence. after then I'll look up online dictionary. I can start with 3,000~4,000 hanjas that is most often appeared in Korean sentences.
        Hide
        SooMyung Lee added a comment -

        Benson Margulies WordSpaceAnalyzer has the feature. If fail to analyze Korean Morphology, KoreanFilter makes WordSpaceAnalyzer try to split eojeol. I'll add some test case.

        Show
        SooMyung Lee added a comment - Benson Margulies WordSpaceAnalyzer has the feature. If fail to analyze Korean Morphology, KoreanFilter makes WordSpaceAnalyzer try to split eojeol. I'll add some test case.
        Hide
        ASF subversion and git services added a comment -

        Commit 1533293 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533293 ]

        LUCENE-4956: Remove MorphException, it is no longer needed. Fix lots of Exception blocks. Remove unused classes.

        Show
        ASF subversion and git services added a comment - Commit 1533293 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533293 ] LUCENE-4956 : Remove MorphException, it is no longer needed. Fix lots of Exception blocks. Remove unused classes.
        Hide
        ASF subversion and git services added a comment -

        Commit 1533286 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533286 ]

        LUCENE-4956: Remove thread-unsafe lazy loading. Initialize in static ctor

        Show
        ASF subversion and git services added a comment - Commit 1533286 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533286 ] LUCENE-4956 : Remove thread-unsafe lazy loading. Initialize in static ctor
        Hide
        ASF subversion and git services added a comment -

        Commit 1533282 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533282 ]

        LUCENE-4956: Remove thread-unsafe lazy loading. Initialize in static ctor

        Show
        ASF subversion and git services added a comment - Commit 1533282 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533282 ] LUCENE-4956 : Remove thread-unsafe lazy loading. Initialize in static ctor
        Hide
        ASF subversion and git services added a comment -

        Commit 1533278 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533278 ]

        LUCENE-4956: More improvements

        Show
        ASF subversion and git services added a comment - Commit 1533278 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533278 ] LUCENE-4956 : More improvements
        Hide
        ASF subversion and git services added a comment -

        Commit 1533277 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533277 ]

        LUCENE-4956: Remove lazy dictionary loading, don't convert to string all the time. This may be improved further if we use an array and substract the smallest codepoint value

        Show
        ASF subversion and git services added a comment - Commit 1533277 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533277 ] LUCENE-4956 : Remove lazy dictionary loading, don't convert to string all the time. This may be improved further if we use an array and substract the smallest codepoint value
        Hide
        Robert Muir added a comment -

        Maybe we can reconstitute this file from other hanja-hangul mappings with clear licenses?

        I have not done any processing, I will investigate sources such as https://code.google.com/p/google-input-tools/source/browse/src/chrome/os/nacl-hangul/misc/symbol.txt and unihan and see what it looks like.

        Show
        Robert Muir added a comment - Maybe we can reconstitute this file from other hanja-hangul mappings with clear licenses? I have not done any processing, I will investigate sources such as https://code.google.com/p/google-input-tools/source/browse/src/chrome/os/nacl-hangul/misc/symbol.txt and unihan and see what it looks like.
        Hide
        Uwe Schindler added a comment -

        the rest of files is created by myself during developing except for mapHanja.dic. I added this file two years ago. I'm not sure that this file has not legal problem because many data came from projects result so it is better to remove that data.

        Do you have some documentation who gave the file to you or where you downloaded it? Some university? Some CD-ROM?

        Show
        Uwe Schindler added a comment - the rest of files is created by myself during developing except for mapHanja.dic. I added this file two years ago. I'm not sure that this file has not legal problem because many data came from projects result so it is better to remove that data. Do you have some documentation who gave the file to you or where you downloaded it? Some university? Some CD-ROM?
        Hide
        Uwe Schindler added a comment - - edited

        Hi SooMyung Lee,

        thanks for the clarification. It was not Jack Krupansky that mentioned the GPL violation, it was Robert and me. I am glad that you are aware of this and you are trying to clarify this. Indeed the License of this file is hard to find out, because the Gnutella one (which is the original) has no license header. But the whole Gnutella project is GPL licensed. Those people also started to donate this code to Google Guava and wanted to relicense to ASF2, but this is not yet done. So we cannot use this code. The missing License header may be the reason for the Blackduck test to be happy.

        Christian Moen offered to donate a PatriciaTrie he wrote himself. Maybe we can replace the gnutella one by this one. I would prefer the solution to not use a trie at all. Instead we should use Lucene's FST feature and bundle the whole dictionary as a serialized FST (like kuromoji does).

        About the other copypasted code: I already removed all commons-io and commons-lang stuff. Commons-io was completely unneeded, because the resource handling to load resources from JAR files was not very good and can be done much easier by a simple Class#getResourceAsStream. I already implemented that and moved some class around, so be sure to update your svn before working more on the module.

        I also removed the \u-escaping from the mapHanja.dic file, so I was able to remove the StringEscapeUtil class, which did too much unescaping (not only \u, also \n, \t,...)! But we should really check the license of this file or create a new one from Unicode tables. I left the file in SVN (converted to plain UTF-8) for now.

        I am currently working on rewriting some code that creates too many small objects like strings all the time, because this slows down indexing! E.g. HanjaUtils should not use a String just to lookup a single char in a map. There are better data structures to hold the mapHanja table.

        We should also not use readLines() to load all dictionaries into heap, then use an iterator over it and convert them to something else. We should use a BufferedReader and read it line by line and do the processing directly.

        Show
        Uwe Schindler added a comment - - edited Hi SooMyung Lee , thanks for the clarification. It was not Jack Krupansky that mentioned the GPL violation, it was Robert and me. I am glad that you are aware of this and you are trying to clarify this. Indeed the License of this file is hard to find out, because the Gnutella one (which is the original) has no license header. But the whole Gnutella project is GPL licensed. Those people also started to donate this code to Google Guava and wanted to relicense to ASF2, but this is not yet done. So we cannot use this code. The missing License header may be the reason for the Blackduck test to be happy. Christian Moen offered to donate a PatriciaTrie he wrote himself. Maybe we can replace the gnutella one by this one. I would prefer the solution to not use a trie at all. Instead we should use Lucene's FST feature and bundle the whole dictionary as a serialized FST (like kuromoji does). About the other copypasted code: I already removed all commons-io and commons-lang stuff. Commons-io was completely unneeded, because the resource handling to load resources from JAR files was not very good and can be done much easier by a simple Class#getResourceAsStream. I already implemented that and moved some class around, so be sure to update your svn before working more on the module. I also removed the \u-escaping from the mapHanja.dic file, so I was able to remove the StringEscapeUtil class, which did too much unescaping (not only \u, also \n, \t,...)! But we should really check the license of this file or create a new one from Unicode tables. I left the file in SVN (converted to plain UTF-8) for now. I am currently working on rewriting some code that creates too many small objects like strings all the time, because this slows down indexing! E.g. HanjaUtils should not use a String just to lookup a single char in a map. There are better data structures to hold the mapHanja table. We should also not use readLines() to load all dictionaries into heap, then use an iterator over it and convert them to something else. We should use a BufferedReader and read it line by line and do the processing directly.
        Hide
        ASF subversion and git services added a comment -

        Commit 1533264 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1533264 ]

        LUCENE-4956: Remove StringEscapeUtils by unescaping the mapHanja.dic

        Show
        ASF subversion and git services added a comment - Commit 1533264 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1533264 ] LUCENE-4956 : Remove StringEscapeUtils by unescaping the mapHanja.dic
        Hide
        Benson Margulies added a comment -

        I am told (I don't read Korean myself) that people often leave out the white space between eojeol that are made up entirely of Hangul letters (Korean letters). Are you just defining these very long things to be single eojeol? Prof Kang in his own work has a module that splits these using some rules.

        Show
        Benson Margulies added a comment - I am told (I don't read Korean myself) that people often leave out the white space between eojeol that are made up entirely of Hangul letters (Korean letters). Are you just defining these very long things to be single eojeol? Prof Kang in his own work has a module that splits these using some rules.
        Hide
        SooMyung Lee added a comment -

        Hi, all.

        I' going to explain how I develop this code as Christian recommended because of license and legal problem that Jack Krupansky mentioned in previous comment.

        I started to write this code and dictionary in 2006 based on a book which author is Seung-Shik, Kang who is a professor of Kookmin university now.

        the dictionary consist of several files but major files are total.dic, josa.dic, eomi.dic and syllable.dic. in first step of developing dictionary, I collected basic stem words for total.dic and particles for josa.dic and eomi.dic from book and various websites. and then I surveyed how basic stem words can be used on online dictionaries. and I only referred to the book to make syllable.dic.
        the rest of files is created by myself during developing except for mapHanja.dic. I added this file two years ago. I'm not sure that this file has not legal problem because many data came from projects result so it is better to remove that data.

        to make source code, I referred to the book so major logic was based on the book except for some utilities classes such as String, File and Trie.java. I copied most of utilities classes from apache common project but Trie.java from other website. I cannot remember the exact website now because it was happend long time ago. but I remember that I read the license that was Apache license.

        I finished first version in 2008 and created an online community on a website (called Naver) and uploaded the source code. the number of community members are over 3700 currently.
        I attended an opensource contest held by Korean government organization in 2009. During the contest, I uploaded the source code to the Sourceforge and got a BlackDuck license test with this code and passed the test.

        I have supported users through the online community (http://cafe.naver.com/korlucene). so some users improved dictionaries and source codes and then posted it on the website. and I merged it and opened it again.

        This is the wohle process how I developed the code. If anybody has something to recommend, Please let me know it.

        Show
        SooMyung Lee added a comment - Hi, all. I' going to explain how I develop this code as Christian recommended because of license and legal problem that Jack Krupansky mentioned in previous comment. I started to write this code and dictionary in 2006 based on a book which author is Seung-Shik, Kang who is a professor of Kookmin university now. the dictionary consist of several files but major files are total.dic, josa.dic, eomi.dic and syllable.dic. in first step of developing dictionary, I collected basic stem words for total.dic and particles for josa.dic and eomi.dic from book and various websites. and then I surveyed how basic stem words can be used on online dictionaries. and I only referred to the book to make syllable.dic. the rest of files is created by myself during developing except for mapHanja.dic. I added this file two years ago. I'm not sure that this file has not legal problem because many data came from projects result so it is better to remove that data. to make source code, I referred to the book so major logic was based on the book except for some utilities classes such as String, File and Trie.java. I copied most of utilities classes from apache common project but Trie.java from other website. I cannot remember the exact website now because it was happend long time ago. but I remember that I read the license that was Apache license. I finished first version in 2008 and created an online community on a website (called Naver) and uploaded the source code. the number of community members are over 3700 currently. I attended an opensource contest held by Korean government organization in 2009. During the contest, I uploaded the source code to the Sourceforge and got a BlackDuck license test with this code and passed the test. I have supported users through the online community ( http://cafe.naver.com/korlucene ). so some users improved dictionaries and source codes and then posted it on the website. and I merged it and opened it again. This is the wohle process how I developed the code. If anybody has something to recommend, Please let me know it.
        Hide
        SooMyung Lee added a comment - - edited

        Benson Margulies Korean Tokenizer has the feature that identify language (Korean, English or Chinese) in Korean sentence. Usually, eojeol in Korean sentence has some different cases. First, eojeol consists of only Korean letters, Second, eojeol can be a combination of Korean letter and alphanumeric letter. Third, eojeol consists of only all alphanumeric letters. Fourth eojeol consists of Chinese letters. Tokinizer treat first and second case as Korean so Korean Morphological analysis is processed in Korean-filter. In third case, I copied code from standard-filter for korean-filter. In fourth case, Korean-filter map Chinese letter to Korean sound and then if it is a compound noun, decompounding is processed based on dictionary.

        Show
        SooMyung Lee added a comment - - edited Benson Margulies Korean Tokenizer has the feature that identify language (Korean, English or Chinese) in Korean sentence. Usually, eojeol in Korean sentence has some different cases. First, eojeol consists of only Korean letters, Second, eojeol can be a combination of Korean letter and alphanumeric letter. Third, eojeol consists of only all alphanumeric letters. Fourth eojeol consists of Chinese letters. Tokinizer treat first and second case as Korean so Korean Morphological analysis is processed in Korean-filter. In third case, I copied code from standard-filter for korean-filter. In fourth case, Korean-filter map Chinese letter to Korean sound and then if it is a compound noun, decompounding is processed based on dictionary.
        Hide
        Benson Margulies added a comment -

        As a potential user of this technology, I'd like to ask for it to have documentation of its linguistic approach.

        • What is the goal of the tokenizer? Is it to deliver eojeol or hyung-tae-so? If eojeol, does it split up the case where Korean writers are sometimes relaxed about whitespace between them?
        • Similarly, what does it set out to index? Does it index eojeol and them also their contained eumjeol or hyung-tae-so, using position-increment / position-length to indicate compound relationships.
        Show
        Benson Margulies added a comment - As a potential user of this technology, I'd like to ask for it to have documentation of its linguistic approach. What is the goal of the tokenizer? Is it to deliver eojeol or hyung-tae-so? If eojeol, does it split up the case where Korean writers are sometimes relaxed about whitespace between them? Similarly, what does it set out to index? Does it index eojeol and them also their contained eumjeol or hyung-tae-so, using position-increment / position-length to indicate compound relationships.
        Robert Muir made changes -
        Attachment eval.patch [ 12608891 ]
        Hide
        Robert Muir added a comment -

        I did a very quick and dirty evaluation of various analyzers (short queries only) with the HANTEC-2 test collection (http://ir.kaist.ac.kr/anthology/2000.10-%EA%B9%80%EC%A7%80%EC%98%81.pdf)

        I compared 4 different analyzers for index time, size, and mean average precision for the "L2" relevance set:

        For each one, I used 3 different ranking strategies: DefaultSimilarity, BM25Similarity, and DFR GL2, no parameter tuning of any sort.

        Analyzer Index Time Index Size MAP(TFIDF) MAP(BM25) MAP(GL2)
        Standard 31s 128MB .0959 .1018 .1028
        CJK 30s 162MB .1746 .1894 .1910
        Korean 195s 125MB .2055 .2096 .2058
        Mecab 138s 147MB .1877 .1960 .1928

        Note that on the first try, I was unable to actually index the entire collection with KoreanAnalyzer, so I had to hack the filter to prevent this:

        xception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 4
        	at java.lang.String.substring(String.java:1907)
        	at org.apache.lucene.analysis.ko.KoreanFilter.analysisChinese(KoreanFilter.java:405)
        	at org.apache.lucene.analysis.ko.KoreanFilter.incrementToken(KoreanFilter.java:147)
        	at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
        	at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:54)
        	at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
        

        See the patch for more information (you can also download the data from http://www.kristalinfo.com/TestCollections/ and set some constants and run it yourself).

        Don't read too far into it, this was really quick and dirty and might somehow be biased. For example, there are several charset issues in the test collection... But it looks like the analyzer here is effective.

        Show
        Robert Muir added a comment - I did a very quick and dirty evaluation of various analyzers (short queries only) with the HANTEC-2 test collection ( http://ir.kaist.ac.kr/anthology/2000.10-%EA%B9%80%EC%A7%80%EC%98%81.pdf ) I compared 4 different analyzers for index time, size, and mean average precision for the "L2" relevance set: StandardAnalyzer (whitespace on hangul / unigrams on hanja) CJKAnalyzer (bigram technique) KoreanAnalyzer MecabAnalyzer via JNI ( https://github.com/bibreen/mecab-ko-lucene-analyzer ) For each one, I used 3 different ranking strategies: DefaultSimilarity, BM25Similarity, and DFR GL2, no parameter tuning of any sort. Analyzer Index Time Index Size MAP(TFIDF) MAP(BM25) MAP(GL2) Standard 31s 128MB .0959 .1018 .1028 CJK 30s 162MB .1746 .1894 .1910 Korean 195s 125MB .2055 .2096 .2058 Mecab 138s 147MB .1877 .1960 .1928 Note that on the first try, I was unable to actually index the entire collection with KoreanAnalyzer, so I had to hack the filter to prevent this: xception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 4 at java.lang.String.substring(String.java:1907) at org.apache.lucene.analysis.ko.KoreanFilter.analysisChinese(KoreanFilter.java:405) at org.apache.lucene.analysis.ko.KoreanFilter.incrementToken(KoreanFilter.java:147) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:54) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) See the patch for more information (you can also download the data from http://www.kristalinfo.com/TestCollections/ and set some constants and run it yourself). Don't read too far into it, this was really quick and dirty and might somehow be biased. For example, there are several charset issues in the test collection... But it looks like the analyzer here is effective.
        Hide
        ASF subversion and git services added a comment -

        Commit 1532750 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1532750 ]

        LUCENE-4956: Replace StringBuffer by StringBuilder

        Show
        ASF subversion and git services added a comment - Commit 1532750 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1532750 ] LUCENE-4956 : Replace StringBuffer by StringBuilder
        Hide
        ASF subversion and git services added a comment -

        Commit 1532749 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1532749 ]

        LUCENE-4956: Cleanup imports

        Show
        ASF subversion and git services added a comment - Commit 1532749 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1532749 ] LUCENE-4956 : Cleanup imports
        Hide
        ASF subversion and git services added a comment -

        Commit 1532748 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1532748 ]

        LUCENE-4956: Cleanup imports

        Show
        ASF subversion and git services added a comment - Commit 1532748 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1532748 ] LUCENE-4956 : Cleanup imports
        Hide
        ASF subversion and git services added a comment -

        Commit 1532747 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1532747 ]

        LUCENE-4956: Hide ctor of static utility classes

        Show
        ASF subversion and git services added a comment - Commit 1532747 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1532747 ] LUCENE-4956 : Hide ctor of static utility classes
        Hide
        ASF subversion and git services added a comment -

        Commit 1532739 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1532739 ]

        LUCENE-4956: More obsolete stuff (not even used), some moves to classes where code parts are solely used

        Show
        ASF subversion and git services added a comment - Commit 1532739 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1532739 ] LUCENE-4956 : More obsolete stuff (not even used), some moves to classes where code parts are solely used
        Hide
        Uwe Schindler added a comment -

        I removed more stuff. Some code was borrowed from common-lang without attribution, too. We have to review the whole code, so we don't violate copyrights or licenses!

        One thing we need to change, too: This code uses the pattern "catch all Exceptions" and rethrow as another one. This affects MorphException. This class should be removed and all methods should simply declare the Exceptions throw. Especially we are not allowed to swallow stack traces! Also code sometimes prints to System.out!

        MorphException is crazy alltogether: It morphs itsself sometimes

        Show
        Uwe Schindler added a comment - I removed more stuff. Some code was borrowed from common-lang without attribution, too. We have to review the whole code, so we don't violate copyrights or licenses! One thing we need to change, too: This code uses the pattern "catch all Exceptions" and rethrow as another one. This affects MorphException. This class should be removed and all methods should simply declare the Exceptions throw. Especially we are not allowed to swallow stack traces! Also code sometimes prints to System.out! MorphException is crazy alltogether: It morphs itsself sometimes
        Hide
        ASF subversion and git services added a comment -

        Commit 1532737 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1532737 ]

        LUCENE-4956: Remove stuff not really needed. TODO: add attribution, because this code is borrowed, too!

        Show
        ASF subversion and git services added a comment - Commit 1532737 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1532737 ] LUCENE-4956 : Remove stuff not really needed. TODO: add attribution, because this code is borrowed, too!
        Hide
        Uwe Schindler added a comment -

        I committed a cleanup of most of the broken and slow resources stuff. It now only uses Class.getResourceAsStream. I also removed code from the FileUtils class (now named DictionaryResources) which was clearly code cloned from somewhere else.

        The resource loading can be further improved:

        • It should not be lazy (isn't thread safe), it should load all resources (like kuromoji) exactly one time into a singleton "holder" class.
        • We should use WordListLoader and nuke the remaining stuff.
        • There is very ineffective and slow code at some places, reloading the same file over and over again, just to do a lookup.

        The code also has legal problems:

        • Trie.java seems to be GPLed (thanks Robert). It seems to be just copied from GNUTella (the name says all). So its defeinitely not Apache Licensed
        Show
        Uwe Schindler added a comment - I committed a cleanup of most of the broken and slow resources stuff. It now only uses Class.getResourceAsStream. I also removed code from the FileUtils class (now named DictionaryResources) which was clearly code cloned from somewhere else. The resource loading can be further improved: It should not be lazy (isn't thread safe), it should load all resources (like kuromoji) exactly one time into a singleton "holder" class. We should use WordListLoader and nuke the remaining stuff. There is very ineffective and slow code at some places, reloading the same file over and over again, just to do a lookup. The code also has legal problems: Trie.java seems to be GPLed (thanks Robert). It seems to be just copied from GNUTella (the name says all). So its defeinitely not Apache Licensed
        Hide
        ASF subversion and git services added a comment -

        Commit 1532707 from Uwe Schindler in branch 'dev/branches/lucene4956'
        [ https://svn.apache.org/r1532707 ]

        LUCENE-4956: First step in remove buggy resources stuff:

        • no more properties file
        • Moved files into correct packages
        • Removed KoreanEnv, DictionaryResources is new class

        Still needs more refactoring!

        Show
        ASF subversion and git services added a comment - Commit 1532707 from Uwe Schindler in branch 'dev/branches/lucene4956' [ https://svn.apache.org/r1532707 ] LUCENE-4956 : First step in remove buggy resources stuff: no more properties file Moved files into correct packages Removed KoreanEnv, DictionaryResources is new class Still needs more refactoring!
        Hide
        Robert Muir added a comment -

        Do we need the Tokenizer here at all or just the filter?

        StandardTokenizer is now tagging runs of hangul text with <HANGUL> and cjk text with <IDEOGRAPHIC> in TypeAttribute, isnt that essentially what is needed there?

        The current tokenizer here just seems to be a clone of an old version of standardtokenizer.

        The filter needs a reset() at the very least, that seems to be the issue with testRandom.

        Show
        Robert Muir added a comment - Do we need the Tokenizer here at all or just the filter? StandardTokenizer is now tagging runs of hangul text with <HANGUL> and cjk text with <IDEOGRAPHIC> in TypeAttribute, isnt that essentially what is needed there? The current tokenizer here just seems to be a clone of an old version of standardtokenizer. The filter needs a reset() at the very least, that seems to be the issue with testRandom.
        Hide
        Christian Moen added a comment -

        SooMyung, I've committed the latest changes we merged in Seoul on Monday. It's great if you can fix the decompounding issue we came across, which we disabled a test for.

        Uwe, +1 to use Class#getResourceAsStream and remove FileUtils and JarResources. I'll make these changes and commit to the branch.

        Overall, I think there's a lot of things we can do to improve this code. Would very much like to hear your opinion on what we should fix before committing to trunk and getting this on the 4.x branch and improve from there. My thinking is that it might be good to get this committed so we'll have Korean working even though the code needs some work. SooMyung has a community in Korea that uses and it's serving their needs as far as I understand.

        Happy to hear people's opinion on this.

        Show
        Christian Moen added a comment - SooMyung, I've committed the latest changes we merged in Seoul on Monday. It's great if you can fix the decompounding issue we came across, which we disabled a test for. Uwe, +1 to use Class#getResourceAsStream and remove FileUtils and JarResources . I'll make these changes and commit to the branch. Overall, I think there's a lot of things we can do to improve this code. Would very much like to hear your opinion on what we should fix before committing to trunk and getting this on the 4.x branch and improve from there. My thinking is that it might be good to get this committed so we'll have Korean working even though the code needs some work. SooMyung has a community in Korea that uses and it's serving their needs as far as I understand. Happy to hear people's opinion on this.
        Hide
        SooMyung Lee added a comment -

        Uwe Schindler Thank you for your advice,
        I have opened this source code in sourceforege since 2009 and have many users. but, nobody told me the bugs and I also didn't know that. Christian and myself will fix the bugs soon. Thank you again.

        Show
        SooMyung Lee added a comment - Uwe Schindler Thank you for your advice, I have opened this source code in sourceforege since 2009 and have many users. but, nobody told me the bugs and I also didn't know that. Christian and myself will fix the bugs soon. Thank you again.
        Hide
        Uwe Schindler added a comment - - edited

        Hi,
        I have seen the same code at a customer and found a big bug in FileUtils and JarResources. We should fix and replace this code completely. It's not platform independent. We should fix the following (in my opinion) horrible code parts:

        • FileUtils: The code with getProtectionDomain is very crazy... It also will never work if the JAR file is not a local file, but maybe some other resource. Its also using APIs that are not intended for the use-case. getProtectionDomain() is for sure not to be used to get the JAR file of the classloader.
        • FileUtils converts the JAR file URL (from getProtectionDomain) in a wrong way to a filesystem path: We should add URL#getPath() to the forbidden APIs, it is almost always a bug!!! The code should use toURI() and then new File(uri). The other methods in FileUtil are also having similar bugs or try to prevent them. The whole class must be removed, sorry!
        • JarResources is some crazy caching for resources and in combination with FileUtils its just wrong. Its also does not scale if you create an uber-jar. The idea of this class is to not always open a stream again, so it loads all resources of the JAR file to memory. This is the wrong way to do. Please remove this!

        We should remove both classes completely and load resources correctly with Class#getResourceAsStream.

        Show
        Uwe Schindler added a comment - - edited Hi, I have seen the same code at a customer and found a big bug in FileUtils and JarResources. We should fix and replace this code completely. It's not platform independent. We should fix the following (in my opinion) horrible code parts: FileUtils: The code with getProtectionDomain is very crazy... It also will never work if the JAR file is not a local file, but maybe some other resource. Its also using APIs that are not intended for the use-case. getProtectionDomain() is for sure not to be used to get the JAR file of the classloader. FileUtils converts the JAR file URL (from getProtectionDomain) in a wrong way to a filesystem path: We should add URL#getPath() to the forbidden APIs, it is almost always a bug!!! The code should use toURI() and then new File(uri). The other methods in FileUtil are also having similar bugs or try to prevent them. The whole class must be removed, sorry! JarResources is some crazy caching for resources and in combination with FileUtils its just wrong. Its also does not scale if you create an uber-jar. The idea of this class is to not always open a stream again, so it loads all resources of the JAR file to memory. This is the wrong way to do. Please remove this! We should remove both classes completely and load resources correctly with Class#getResourceAsStream.
        Hide
        Christian Moen added a comment -

        Soomyung and myself met up in Seoul today and we've merged his latest locally. I'll commit the changes to this branch when I'm back in Tokyo and Soomyung will follow up with fixing a known issue afterwards. Hopefully we can commit this to trunk very soon.

        Show
        Christian Moen added a comment - Soomyung and myself met up in Seoul today and we've merged his latest locally. I'll commit the changes to this branch when I'm back in Tokyo and Soomyung will follow up with fixing a known issue afterwards. Hopefully we can commit this to trunk very soon.
        Hide
        Christian Moen added a comment -

        Thanks a lot.

        Show
        Christian Moen added a comment - Thanks a lot.
        Hide
        SooMyung Lee added a comment -

        Christian,

        Yes, I worked my last patch against on lucene4956. I'll check the problem and inform you how to solve it within today.

        Show
        SooMyung Lee added a comment - Christian, Yes, I worked my last patch against on lucene4956. I'll check the problem and inform you how to solve it within today.
        Hide
        Christian Moen added a comment -

        SooMyung,

        The patch you uploaded on September 11th, was that made against the latest lucene4956 branch?

        The patch doesn't apply properly against on lucene4956 for me. Could you clarify its origin and instruct me how it can be applied? If you can make a patch against the code on lucene4956, that would be much appreciated.

        Thanks!

        Show
        Christian Moen added a comment - SooMyung, The patch you uploaded on September 11th, was that made against the latest lucene4956 branch? The patch doesn't apply properly against on lucene4956 for me. Could you clarify its origin and instruct me how it can be applied? If you can make a patch against the code on lucene4956 , that would be much appreciated. Thanks!
        Hide
        Christian Moen added a comment -

        Thanks for pushing me on this. I'll have a look at your recent changes and commit to trunk shortly if everything seems fine. I hope to have this committed to trunk early next week. Sorry for this having dragged out.

        Show
        Christian Moen added a comment - Thanks for pushing me on this. I'll have a look at your recent changes and commit to trunk shortly if everything seems fine. I hope to have this committed to trunk early next week. Sorry for this having dragged out.
        Hide
        SooMyung Lee added a comment -

        Hi Christian,

        I didn't hear any news from you since last August.
        Do you have any problem with moving to next step?

        I run a Korean developers community for the Korean Analyzer.
        I announced that Arirang analyzer will be incorporated into lucene and solr soon.
        So, many developers are waiting for that.

        I want we go to next step quickly. If you need any help, Please let me know.

        Show
        SooMyung Lee added a comment - Hi Christian, I didn't hear any news from you since last August. Do you have any problem with moving to next step? I run a Korean developers community for the Korean Analyzer. I announced that Arirang analyzer will be incorporated into lucene and solr soon. So, many developers are waiting for that. I want we go to next step quickly. If you need any help, Please let me know.
        SooMyung Lee made changes -
        Attachment lucene-4956.patch [ 12602525 ]
        Hide
        SooMyung Lee added a comment -

        Hi Christian,

        I have sync up and I made some modification.
        I'm attaching the patch.

        Show
        SooMyung Lee added a comment - Hi Christian, I have sync up and I made some modification. I'm attaching the patch.
        Hide
        SooMyung Lee added a comment -

        Hi, Christian
        Thank you for your effort.

        I'll review the changes what you made.
        I've been also made some improvements. I'll upload the patch soon.

        Show
        SooMyung Lee added a comment - Hi, Christian Thank you for your effort. I'll review the changes what you made. I've been also made some improvements. I'll upload the patch soon.
        Hide
        Christian Moen added a comment -

        SooMyung, let's sync up regarding your latest changes (the patch you attached). I'm thinking perhaps we can merge to trunk first and iterate from there. Thanks.

        Show
        Christian Moen added a comment - SooMyung, let's sync up regarding your latest changes (the patch you attached). I'm thinking perhaps we can merge to trunk first and iterate from there. Thanks.
        Hide
        Christian Moen added a comment -

        Attaching a patch against trunk (r1513348).

        Show
        Christian Moen added a comment - Attaching a patch against trunk (r1513348).
        Christian Moen made changes -
        Attachment LUCENE-4956.patch [ 12597912 ]
        Hide
        Christian Moen added a comment -

        I've now aligned the branch with trunk, updated the example schema.xml to use text_ko naming for the Korean field type.

        I've also indexed Korean Wikipedia continuously for a few hours and the JVM heap looks fine.

        There are several additional things that can be done with this code, including generating the parser using JFlex at build time, fixing some of the position issues with random-blasting, cleanups and dead-code removal, etc. This said, I believe the code we have is useful to Korean users as-is and I'm thinking it's a good idea to integrate it into trunk and iterate further from there.

        Please share your thoughts. Thanks.

        Show
        Christian Moen added a comment - I've now aligned the branch with trunk , updated the example schema.xml to use text_ko naming for the Korean field type. I've also indexed Korean Wikipedia continuously for a few hours and the JVM heap looks fine. There are several additional things that can be done with this code, including generating the parser using JFlex at build time, fixing some of the position issues with random-blasting, cleanups and dead-code removal, etc. This said, I believe the code we have is useful to Korean users as-is and I'm thinking it's a good idea to integrate it into trunk and iterate further from there. Please share your thoughts. Thanks.
        Hide
        SooMyung Lee added a comment -

        Hi, Christian

        I can understand your situation. I know you run the company.

        I was just wondering if there is any problem with integrating it.
        If you need any help, please let me know it.

        Show
        SooMyung Lee added a comment - Hi, Christian I can understand your situation. I know you run the company. I was just wondering if there is any problem with integrating it. If you need any help, please let me know it.
        Hide
        Christian Moen added a comment -

        Hello SooMyung,

        I'm the one who haven't followed up properly on this as I've been too bogged down with other things. I've set aside time next week to work on this and I hope to have Korean merged and integrated with trunk then. I'm not sure we can make 4.4, but I'm willing to put in extra effort if there's a chance we can get it in in time.

        Show
        Christian Moen added a comment - Hello SooMyung, I'm the one who haven't followed up properly on this as I've been too bogged down with other things. I've set aside time next week to work on this and I hope to have Korean merged and integrated with trunk then. I'm not sure we can make 4.4, but I'm willing to put in extra effort if there's a chance we can get it in in time.
        Hide
        SooMyung Lee added a comment -

        Hi, Steve

        I see you created 4.4 branch for releasing.
        After I looked over it, I found that the Korean analyzer(Arirang) is missing.

        Can you tell me when the korean analyzer can be incorporated into the release.

        Show
        SooMyung Lee added a comment - Hi, Steve I see you created 4.4 branch for releasing. After I looked over it, I found that the Korean analyzer(Arirang) is missing. Can you tell me when the korean analyzer can be incorporated into the release.
        Hide
        Steve Rowe added a comment -

        I'm in process now, should be done in a little bit.

        Done: committed the 'kr'->'ko' switch at r1486269 on branches/lucene4956/.

        Show
        Steve Rowe added a comment - I'm in process now, should be done in a little bit. Done: committed the 'kr'->'ko' switch at r1486269 on branches/lucene4956/.
        Hide
        Christian Moen added a comment -

        Thanks a lot!

        Show
        Christian Moen added a comment - Thanks a lot!
        Hide
        Steve Rowe added a comment -

        Hi Christian,

        I'm in process now, should be done in a little bit.

        BTW, I also brought the branch up-to-date with trunk.

        Steve

        Show
        Steve Rowe added a comment - Hi Christian, I'm in process now, should be done in a little bit. BTW, I also brought the branch up-to-date with trunk. Steve
        Hide
        Christian Moen added a comment -

        I'm happy to take care of this unless you want to do it, Steve. I can do this either tomorrow or on Friday. Thanks.

        Show
        Christian Moen added a comment - I'm happy to take care of this unless you want to do it, Steve. I can do this either tomorrow or on Friday. Thanks.
        Hide
        Steve Rowe added a comment -

        I have named the korean analyzer package as "kr" but recently I found that It is incorrect. "kr" is the country code of the south korean and "kp" is the country code of the north korea. I think "ko" is more suitable for the name of the korean anzlyzer package. "ko" is the korean language code. So, I want you to rename the korean analyzer package from "kr" to "ko"

        Hi SooMyung, thanks, I'll make the switch (unless Christian beats me to it).

        Show
        Steve Rowe added a comment - I have named the korean analyzer package as "kr" but recently I found that It is incorrect. "kr" is the country code of the south korean and "kp" is the country code of the north korea. I think "ko" is more suitable for the name of the korean anzlyzer package. "ko" is the korean language code. So, I want you to rename the korean analyzer package from "kr" to "ko" Hi SooMyung, thanks, I'll make the switch (unless Christian beats me to it).
        Hide
        Walter Underwood added a comment -

        Yes, "ko" is correct. Use country codes for locales, but language codes for stemmers.

        Show
        Walter Underwood added a comment - Yes, "ko" is correct. Use country codes for locales, but language codes for stemmers.
        Hide
        SooMyung Lee added a comment -

        Hi, Christian.

        I have named the korean analyzer package as "kr" but recently I found that It is incorrect. "kr" is the country code of the south korean and "kp" is the country code of the north korea. I think "ko" is more suitable for the name of the korean anzlyzer package. "ko" is the korean language code. So, I want you to rename the korean analyzer package from "kr" to "ko".

        Show
        SooMyung Lee added a comment - Hi, Christian. I have named the korean analyzer package as "kr" but recently I found that It is incorrect. "kr" is the country code of the south korean and "kp" is the country code of the north korea. I think "ko" is more suitable for the name of the korean anzlyzer package. "ko" is the korean language code. So, I want you to rename the korean analyzer package from "kr" to "ko".
        SooMyung Lee made changes -
        Attachment lucene4956.patch [ 12583731 ]
        Hide
        SooMyung Lee added a comment - - edited

        Hi, christian.

        I have made some changes of the source code and uploaded that.
        I have changed the source code relating to keyword extraction.
        I have removed the properties relating to keyword extraction and changed the keyword extraction logic.
        I've also added a test case that describe how the korean analyzer works.

        I hope this is of some help to you!

        Show
        SooMyung Lee added a comment - - edited Hi, christian. I have made some changes of the source code and uploaded that. I have changed the source code relating to keyword extraction. I have removed the properties relating to keyword extraction and changed the keyword extraction logic. I've also added a test case that describe how the korean analyzer works. I hope this is of some help to you!
        Hide
        Christian Moen added a comment -

        I've run KoreanAnalyzer on Korean Wikipedia and also had a look at memory/heap usage. Things look okay overall.

        I believe KoreanFilter uses wrong offsets for synonym tokens, which was discovered by random-blasting. Looking into the issue...

        Show
        Christian Moen added a comment - I've run KoreanAnalyzer on Korean Wikipedia and also had a look at memory/heap usage. Things look okay overall. I believe KoreanFilter uses wrong offsets for synonym tokens, which was discovered by random-blasting. Looking into the issue...
        Hide
        SooMyung Lee added a comment -

        Hi, Christian.
        Until this week, I'll prepare some test cases and documents that explain how the options work and why those are needed.

        Show
        SooMyung Lee added a comment - Hi, Christian. Until this week, I'll prepare some test cases and documents that explain how the options work and why those are needed.
        Hide
        Christian Moen added a comment -

        Hello SooMyung,

        Thanks for the above regarding field type. The general approach we have taken in Lucene is to do the same analysis at both index and query side. For example, the Japanese analyzer also has functionality to do compound splitting and we've discussed doing this one the index side only per default for field type text_ja, but we decided against it.

        I've included your field type in the latest code I've checked in just now, but it's likely that we will change this in the future.

        I'm wondering if you could help me with a few sample sentences that illustrates the various options KoreanFilter has. I'd like to add some test-cases for these to better understand the differences between them and to verify correct behaviour. Test-cases for this is also a useful way to document functionality in general. Thanks for any help with this!

        Show
        Christian Moen added a comment - Hello SooMyung, Thanks for the above regarding field type. The general approach we have taken in Lucene is to do the same analysis at both index and query side. For example, the Japanese analyzer also has functionality to do compound splitting and we've discussed doing this one the index side only per default for field type text_ja , but we decided against it. I've included your field type in the latest code I've checked in just now, but it's likely that we will change this in the future. I'm wondering if you could help me with a few sample sentences that illustrates the various options KoreanFilter has. I'd like to add some test-cases for these to better understand the differences between them and to verify correct behaviour. Test-cases for this is also a useful way to document functionality in general. Thanks for any help with this!
        Hide
        Christian Moen added a comment -

        Thanks, Steve & co.!

        Show
        Christian Moen added a comment - Thanks, Steve & co.!
        Hide
        SooMyung Lee added a comment -

        Cool! Thanks, Steve

        Show
        SooMyung Lee added a comment - Cool! Thanks, Steve
        Hide
        Steve Rowe added a comment -

        Yesterday I called a vote for this contribution on general@incubator.apache.org: http://mail-archives.apache.org/mod_mbox/incubator-general/201305.mbox/%3c7AD4D4E3-530B-41E3-8323-DA3D66A40E7E@apache.org%3e

        This vote has passed, so we're now free to incorporate this contribution into the code base when and as we see fit.

        Show
        Steve Rowe added a comment - Yesterday I called a vote for this contribution on general@incubator.apache.org: http://mail-archives.apache.org/mod_mbox/incubator-general/201305.mbox/%3c7AD4D4E3-530B-41E3-8323-DA3D66A40E7E@apache.org%3e This vote has passed, so we're now free to incorporate this contribution into the code base when and as we see fit.
        Hide
        SooMyung Lee added a comment -

        Hi Christian,
        Thanks for your great work.

        I'd like to ask you to modify the text_kr field type definition in schema.xml as follows

            <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="100">
              <analyzer type="index">
                <tokenizer class="solr.KoreanTokenizerFactory"/>
                <filter class="solr.KoreanFilterFactory hasOrigin="true" hasCNoun="true"  bigrammable="true""/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_kr.txt"/>
              </analyzer>
              <analyzer type="query">
                <tokenizer class="solr.KoreanTokenizerFactory"/>
                <filter class="solr.KoreanFilterFactory hasOrigin="false" hasCNoun="false"  bigrammable="false""/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_kr.txt"/>
              </analyzer>      
            </fieldType>
        
        Show
        SooMyung Lee added a comment - Hi Christian, Thanks for your great work. I'd like to ask you to modify the text_kr field type definition in schema.xml as follows <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.KoreanTokenizerFactory"/> <filter class="solr.KoreanFilterFactory hasOrigin="true" hasCNoun="true" bigrammable="true""/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_kr.txt"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KoreanTokenizerFactory"/> <filter class="solr.KoreanFilterFactory hasOrigin="false" hasCNoun="false" bigrammable="false""/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_kr.txt"/> </analyzer> </fieldType>
        Hide
        Edward J. Yoon added a comment -

        Great job!

        Show
        Edward J. Yoon added a comment - Great job!
        Hide
        Christian Moen added a comment -

        Updates:

        • Added text_kr field type to schema.xml
        • Fixed Solr factories to load field type text_kr in the example
        • Updated javadoc so that it compiles cleanly (mostly removed illegal javadoc)
        • Updated various build things related to include Korean in the Solr distribution
        • Added placeholder stopwords file
        • Added services for arirang

        Korean analysis using field type text_kr seems to be doing the right thing out-of-the-box now, but some configuration options in the factories aren't working as of now. There are several other things that needs polishing up, but we're making progress.

        Show
        Christian Moen added a comment - Updates: Added text_kr field type to schema.xml Fixed Solr factories to load field type text_kr in the example Updated javadoc so that it compiles cleanly (mostly removed illegal javadoc) Updated various build things related to include Korean in the Solr distribution Added placeholder stopwords file Added services for arirang Korean analysis using field type text_kr seems to be doing the right thing out-of-the-box now, but some configuration options in the factories aren't working as of now. There are several other things that needs polishing up, but we're making progress.
        Hide
        Steve Rowe added a comment -
        Show
        Steve Rowe added a comment - Yesterday I called a vote for this contribution on general@incubator.apache.org: http://mail-archives.apache.org/mod_mbox/incubator-general/201305.mbox/%3c7AD4D4E3-530B-41E3-8323-DA3D66A40E7E@apache.org%3e
        Hide
        Steve Rowe added a comment -

        Hi Jack,

        From http://incubator.apache.org/ip-clearance/, which is (quoting from that page):

        Intellectual property clearance

        One of the Incubator's roles is to ensure that proper attention is paid to intellectual property. From time to time, an external codebase is brought into the ASF that is not a separate incubating project but still represents a substantial contribution that was not developed within the ASF's source control system and on our public mailing lists. This is a short form of the Incubation checklist, designed to allow code to be imported with alacrity while still providing for oversight.
        [...]
        Once a PMC directly checks-in a filled-out short form, the Incubator PMC will need to approve the paper work after which point the receiving PMC is free to import the code.

        The "short form" referred to above is an XML template, which I've completed for this code base, and which is at some (apparently regular?) interval converted to HTML (this is also linked from the above-linked IP clearance page as "Korean Analyzer"): http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html

        Show
        Steve Rowe added a comment - Hi Jack, From http://incubator.apache.org/ip-clearance/ , which is (quoting from that page): Intellectual property clearance One of the Incubator's roles is to ensure that proper attention is paid to intellectual property. From time to time, an external codebase is brought into the ASF that is not a separate incubating project but still represents a substantial contribution that was not developed within the ASF's source control system and on our public mailing lists. This is a short form of the Incubation checklist, designed to allow code to be imported with alacrity while still providing for oversight. [...] Once a PMC directly checks-in a filled-out short form, the Incubator PMC will need to approve the paper work after which point the receiving PMC is free to import the code. The "short form" referred to above is an XML template, which I've completed for this code base, and which is at some (apparently regular?) interval converted to HTML (this is also linked from the above-linked IP clearance page as "Korean Analyzer"): http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html
        Hide
        Robert Muir added a comment -

        Jack, thats correct.

        It is a vote for IP clearance. For example, Simon called an IP clearance vote on the incubator list for Kuromoji before we integrated it into Lucene.

        Show
        Robert Muir added a comment - Jack, thats correct. It is a vote for IP clearance. For example, Simon called an IP clearance vote on the incubator list for Kuromoji before we integrated it into Lucene.
        Hide
        Jack Krupansky added a comment -

        I am not really familiar with the "incubator-general vote". From looking at the legal clearance page, it sounds like the vote is simply "accepting the donation", as opposed to voting that the branch is ready to commit to trunk, correct?

        I did a Jira search and found no previous references to "incubator-general vote" - from Google search I got the impression it was more related to podlings rather than simple code module contributions.

        Show
        Jack Krupansky added a comment - I am not really familiar with the "incubator-general vote". From looking at the legal clearance page, it sounds like the vote is simply "accepting the donation", as opposed to voting that the branch is ready to commit to trunk, correct? I did a Jira search and found no previous references to "incubator-general vote" - from Google search I got the impression it was more related to podlings rather than simple code module contributions.
        Hide
        Christian Moen added a comment -

        I think we're ready for the incubator-general vote. Christian Moen, do you agree?

        +1

        Show
        Christian Moen added a comment - I think we're ready for the incubator-general vote. Christian Moen , do you agree? +1
        Hide
        SooMyung Lee added a comment -

        Christian Moen I'm sorry that I didn't reply to your comment on the last weekend! I'm seeing that Steve Rowe solved your problem. am I right?
        Steve Rowe I checked the method. isNounPart() is no more necessary.
        Spaces should be inserted between phrases in a korean sentence, but many people are confused in where inserting spaces.

        The isNounPart() method examine if spaces should be inserted at a specific position only when a noun existing in the dictionary precede it.
        After testing, I found that the method is superfluous.
        I'm sorry not to correct the source code before contributing.

        Show
        SooMyung Lee added a comment - Christian Moen I'm sorry that I didn't reply to your comment on the last weekend! I'm seeing that Steve Rowe solved your problem. am I right? Steve Rowe I checked the method. isNounPart() is no more necessary. Spaces should be inserted between phrases in a korean sentence, but many people are confused in where inserting spaces. The isNounPart() method examine if spaces should be inserted at a specific position only when a noun existing in the dictionary precede it. After testing, I found that the method is superfluous. I'm sorry not to correct the source code before contributing.
        Hide
        Commit Tag Bot added a comment -

        [lucene4956 commit] sarowe
        http://svn.apache.org/viewvc?view=revision&revision=1479410

        LUCENE-4956: - svn:eol-style -> native

        • tabs -> spaces
        • regularized java code indents to 2 spaces per level
        Show
        Commit Tag Bot added a comment - [lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479410 LUCENE-4956 : - svn:eol-style -> native tabs -> spaces regularized java code indents to 2 spaces per level
        Hide
        Steve Rowe added a comment -

        soomyung, I don't understand the following method in WordSpaceAnalyzer.java - what's the point of the method always returning false? (i.e.: if(true) return false;):

        private boolean isNounPart(String str, int jstart) throws MorphException  {
            
          if(true) return false;
            
          for(int i=jstart-1;i>=0;i--) {      
            if(DictionaryUtil.getWordExceptVerb(str.substring(i,jstart+1))!=null)
              return true;
          }
            
          return false;
        }
        

        isNounPart() is only called from one method in the same class: findJosaEnd(snipt,jstart):

        if(DictionaryUtil.existJosa(str) && !findNounWithinStr(snipt,i,i+2) && !isNounPart(snipt,jstart)) {
        
        Show
        Steve Rowe added a comment - soomyung , I don't understand the following method in WordSpaceAnalyzer.java - what's the point of the method always returning false? (i.e.: if(true) return false; ): private boolean isNounPart( String str, int jstart) throws MorphException { if ( true ) return false ; for ( int i=jstart-1;i>=0;i--) { if (DictionaryUtil.getWordExceptVerb(str.substring(i,jstart+1))!= null ) return true ; } return false ; } isNounPart() is only called from one method in the same class: findJosaEnd(snipt,jstart) : if (DictionaryUtil.existJosa(str) && !findNounWithinStr(snipt,i,i+2) && !isNounPart(snipt,jstart)) {
        Hide
        Steve Rowe added a comment -

        I added license headers to the dictionary files, so AFAICT all files now have Apache License headers.

        I've updated http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html - it looks ready to go to me. (Again, I can only the control the XML version of this, at http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/lucene-korean-analyzer.xml, so it might be a day or so before the HTML version catches up.)

        I think we're ready for the incubator-general vote. Christian Moen, do you agree?

        We don't need to wait for the vote result to continue making improvements, e.g. tabs->space, svn:eol-style->native, etc. - the vote email will point to the revision on the branch we think is vote-worthy: r1479391.

        Show
        Steve Rowe added a comment - I added license headers to the dictionary files, so AFAICT all files now have Apache License headers. I've updated http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html - it looks ready to go to me. (Again, I can only the control the XML version of this, at http://svn.apache.org/repos/asf/incubator/public/trunk/content/ip-clearance/lucene-korean-analyzer.xml , so it might be a day or so before the HTML version catches up.) I think we're ready for the incubator-general vote. Christian Moen , do you agree? We don't need to wait for the vote result to continue making improvements, e.g. tabs->space, svn:eol-style->native, etc. - the vote email will point to the revision on the branch we think is vote-worthy: r1479391.
        Hide
        Commit Tag Bot added a comment -

        [lucene4956 commit] sarowe
        http://svn.apache.org/viewvc?view=revision&revision=1479391

        LUCENE-4956: Add license headers to dictionary files, and modify FileUtil.readLines() to ignore lines beginning with comment char '!'

        Show
        Commit Tag Bot added a comment - [lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479391 LUCENE-4956 : Add license headers to dictionary files, and modify FileUtil.readLines() to ignore lines beginning with comment char '!'
        Hide
        Commit Tag Bot added a comment -
        Show
        Commit Tag Bot added a comment - [lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479386 LUCENE-4956 : fix typo
        Hide
        Steve Rowe added a comment -

        This looks like a typo to me, in KoreanEnv.java - the second FILE_DICTIONARY should instead be FILE_EXTENSION:

        /**
         * Initialize the default property values.
         */
        private void initDefaultProperties() {
          defaults = new Properties();
        	
          defaults.setProperty(FILE_SYLLABLE_FEATURE,"org/apache/lucene/analysis/kr/dic/syllable.dic");
          defaults.setProperty(FILE_DICTIONARY,"org/apache/lucene/analysis/kr/dic/dictionary.dic");
          defaults.setProperty(FILE_DICTIONARY,"org/apache/lucene/analysis/kr/dic/extension.dic");		
        
        Show
        Steve Rowe added a comment - This looks like a typo to me, in KoreanEnv.java - the second FILE_DICTIONARY should instead be FILE_EXTENSION : /** * Initialize the default property values. */ private void initDefaultProperties() { defaults = new Properties(); defaults.setProperty(FILE_SYLLABLE_FEATURE, "org/apache/lucene/analysis/kr/dic/syllable.dic" ); defaults.setProperty(FILE_DICTIONARY, "org/apache/lucene/analysis/kr/dic/dictionary.dic" ); defaults.setProperty(FILE_DICTIONARY, "org/apache/lucene/analysis/kr/dic/extension.dic" );
        Hide
        Commit Tag Bot added a comment -

        [lucene4956 commit] sarowe
        http://svn.apache.org/viewvc?view=revision&revision=1479362

        LUCENE-4956: Remove o.a.l.analysis.kr.utils.StringUtil and all calls to it (mostly StringUtil.split, replaced with String.split)

        Show
        Commit Tag Bot added a comment - [lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479362 LUCENE-4956 : Remove o.a.l.analysis.kr.utils.StringUtil and all calls to it (mostly StringUtil.split, replaced with String.split)
        Hide
        Steve Rowe added a comment - - edited

        Could you comment about the origins and authorship of org.apache.lucene.analysis.kr.utils.StringUtil in your tar file?
        I'm seeing a lot of authors in this file. Is this from Apache Commons Lang? Thanks!

        I looked at the file content, and it's definitely from Apache Commons Lang (the class is named StringUtils there, renamed StringUtil here), circa early 2010, maybe with a little pulled in from another Commons Lang class.

        I've eliminated StringUtil - it's almost all calls to StringUtils.split(String, separators) - its javadoc is:

        /**
         * <p>Splits the provided text into an array, separators specified.
         * This is an alternative to using StringTokenizer.</p>
         *
         * <p>The separator is not included in the returned String array.
         * Adjacent separators are treated as one separator.
         * For more control over the split use the StrTokenizer class.</p>
         *
         * <p>A <code>null</code> input String returns <code>null</code>.
         * A <code>null</code> separatorChars splits on whitespace.</p>
         *
         * <pre>
         * StringUtil.split(null, *)         = null
         * StringUtil.split("", *)           = []
         * StringUtil.split("abc def", null) = ["abc", "def"]
         * StringUtil.split("abc def", " ")  = ["abc", "def"]
         * StringUtil.split("abc  def", " ") = ["abc", "def"]
         * StringUtil.split("ab:cd:ef", ":") = ["ab", "cd", "ef"]
         * </pre>
         *
         * @param str  the String to parse, may be null
         * @param separatorChars  the characters used as the delimiters,
         *  <code>null</code> splits on whitespace
         * @return an array of parsed Strings, <code>null</code> if null String input
         */
        

        I'm replacing calls to this method with calls to String.split(regex), where regex is "[char]+", and char is the (in all cases singular) split character.

        I'll commit the changes and the StringUtil.java removal in a little bit once I've got it compiling and the tests succeed.

        Show
        Steve Rowe added a comment - - edited Could you comment about the origins and authorship of org.apache.lucene.analysis.kr.utils.StringUtil in your tar file? I'm seeing a lot of authors in this file. Is this from Apache Commons Lang? Thanks! I looked at the file content, and it's definitely from Apache Commons Lang (the class is named StringUtils there, renamed StringUtil here), circa early 2010, maybe with a little pulled in from another Commons Lang class. I've eliminated StringUtil - it's almost all calls to StringUtils.split(String, separators) - its javadoc is: /** * <p>Splits the provided text into an array, separators specified. * This is an alternative to using StringTokenizer.</p> * * <p>The separator is not included in the returned String array. * Adjacent separators are treated as one separator. * For more control over the split use the StrTokenizer class.</p> * * <p>A <code> null </code> input String returns <code> null </code>. * A <code> null </code> separatorChars splits on whitespace.</p> * * <pre> * StringUtil.split( null , *) = null * StringUtil.split("", *) = [] * StringUtil.split( "abc def" , null ) = [ "abc" , "def" ] * StringUtil.split( "abc def" , " " ) = [ "abc" , "def" ] * StringUtil.split( "abc def" , " " ) = [ "abc" , "def" ] * StringUtil.split( "ab:cd:ef" , ":" ) = [ "ab" , "cd" , "ef" ] * </pre> * * @param str the String to parse, may be null * @param separatorChars the characters used as the delimiters, * <code> null </code> splits on whitespace * @ return an array of parsed Strings, <code> null </code> if null String input */ I'm replacing calls to this method with calls to String.split(regex), where regex is " [char] +", and char is the (in all cases singular) split character. I'll commit the changes and the StringUtil.java removal in a little bit once I've got it compiling and the tests succeed.
        Hide
        Steve Rowe added a comment -

        Thanks, Steve. I've added the missing license header to TestKoreanAnalyzer.java.

        I looked over the rest of the files, and the only things missing license headers are the dictionary files and the korean.properties file, all under src/resources/. I committed a license header to korean.properties.

        I tried adding '#'-commented-out headers to the .dic files (a couple of them already have '######' and '//######' lines), but that triggered a test failure, so more work will need to be done to make the license headers inline in the dictionary files.

        Show
        Steve Rowe added a comment - Thanks, Steve. I've added the missing license header to TestKoreanAnalyzer.java. I looked over the rest of the files, and the only things missing license headers are the dictionary files and the korean.properties file, all under src/resources/ . I committed a license header to korean.properties . I tried adding '#'-commented-out headers to the .dic files (a couple of them already have '######' and '//######' lines), but that triggered a test failure, so more work will need to be done to make the license headers inline in the dictionary files.
        Hide
        Christian Moen added a comment -

        Good points, Uwe. I'll look into this.

        Show
        Christian Moen added a comment - Good points, Uwe. I'll look into this.
        Hide
        Uwe Schindler added a comment -

        I have seen the Tokenizer also uses JFlex, but an older version as used for Lucene's other tokenizers (like StandardTokenizer). Can we add the ANT tasks like we have for StandardTokenizer to regenerate the source file from build.xml. Finally we should regenerate the Java files with the JFlex trunk version and compare with the one committed here (if there are differences).

        Show
        Uwe Schindler added a comment - I have seen the Tokenizer also uses JFlex, but an older version as used for Lucene's other tokenizers (like StandardTokenizer). Can we add the ANT tasks like we have for StandardTokenizer to regenerate the source file from build.xml. Finally we should regenerate the Java files with the JFlex trunk version and compare with the one committed here (if there are differences).
        Hide
        Christian Moen added a comment -

        Thanks, Steve. I've added the missing license header to TestKoreanAnalyzer.java.

        Show
        Christian Moen added a comment - Thanks, Steve. I've added the missing license header to TestKoreanAnalyzer.java .
        Hide
        Steve Rowe added a comment -

        I've created branch lucene4956 and checked in an arirang module in lucene/analysis. I've added a basic test that tests segmentation, offsets, etc.

        Cool!

        License headers have been added to all source code files

        I can see one that doesn't have one: TestKoreanAnalyzer.java. I'll take a pass over all the files.

        Eclipse is TODO.

        I ran ant eclipse and it seemed to do the right thing already -I can see Arirang entries in the .classpath file that gets produced - I don't think there's anything to be done. I don't use Eclipse, though, so I can't be sure.

        I added Maven config and an IntelliJ Arirang module test run configuration.

        Show
        Steve Rowe added a comment - I've created branch lucene4956 and checked in an arirang module in lucene/analysis. I've added a basic test that tests segmentation, offsets, etc. Cool! License headers have been added to all source code files I can see one that doesn't have one: TestKoreanAnalyzer.java. I'll take a pass over all the files. Eclipse is TODO. I ran ant eclipse and it seemed to do the right thing already -I can see Arirang entries in the .classpath file that gets produced - I don't think there's anything to be done. I don't use Eclipse, though, so I can't be sure. I added Maven config and an IntelliJ Arirang module test run configuration.
        Hide
        Commit Tag Bot added a comment -

        [lucene4956 commit] sarowe
        http://svn.apache.org/viewvc?view=revision&revision=1479239

        LUCENE-4956: add IntelliJ test run config for Arirang; add Maven config for Arirang

        Show
        Commit Tag Bot added a comment - [lucene4956 commit] sarowe http://svn.apache.org/viewvc?view=revision&revision=1479239 LUCENE-4956 : add IntelliJ test run config for Arirang; add Maven config for Arirang
        Hide
        Christian Moen added a comment -

        I've created branch lucene4956 and checked in an arirang module in lucene/analysis. I've added a basic test that tests segmentation, offsets, etc.

        Other updates:

        • Some compilation warnings related to generics have been fixed, but several are to go.
        • License headers have been added to all source code files
        • Author tags have been removed from all files, except StringUtils pending SooMyoung's feedback (see above)
        • Added IntelliJ IDEA config to make ant idea set things up correctly. Eclipse is TODO.

        My next step is to fix the compilation related warning altogether and once we confirmed StringUtils, I think we can do the incubator-general vote. I'll keep you posted.

        I think we should also consider rewriting and optimise some of the code here and there, but that's for later. It's great if you can be involved in this process, SooMyoung! I'll probably need your help and good advice here and there.

        Show
        Christian Moen added a comment - I've created branch lucene4956 and checked in an arirang module in lucene/analysis . I've added a basic test that tests segmentation, offsets, etc. Other updates: Some compilation warnings related to generics have been fixed, but several are to go. License headers have been added to all source code files Author tags have been removed from all files, except StringUtils pending SooMyoung's feedback (see above) Added IntelliJ IDEA config to make ant idea set things up correctly. Eclipse is TODO. My next step is to fix the compilation related warning altogether and once we confirmed StringUtils , I think we can do the incubator-general vote. I'll keep you posted. I think we should also consider rewriting and optimise some of the code here and there, but that's for later. It's great if you can be involved in this process, SooMyoung! I'll probably need your help and good advice here and there.
        Hide
        Christian Moen added a comment -

        Hello SooMyoung,

        Could you comment about the origins and authorship of org.apache.lucene.analysis.kr.utils.StringUtil in your tar file?

        I'm seeing a lot of authors in this file. Is this from Apache Commons Lang? Thanks!

        Show
        Christian Moen added a comment - Hello SooMyoung, Could you comment about the origins and authorship of org.apache.lucene.analysis.kr.utils.StringUtil in your tar file? I'm seeing a lot of authors in this file. Is this from Apache Commons Lang? Thanks!
        Christian Moen made changes -
        Assignee Christian Moen [ cm ]
        Hide
        Commit Tag Bot added a comment -

        [lucene4956 commit] cm
        http://svn.apache.org/viewvc?view=revision&revision=1479228

        Branch to work on Korean (LUCENE-4956)

        Show
        Commit Tag Bot added a comment - [lucene4956 commit] cm http://svn.apache.org/viewvc?view=revision&revision=1479228 Branch to work on Korean ( LUCENE-4956 )
        Hide
        Christian Moen added a comment -

        A quick status update on my side is as follows:

        I've put the code into an a module called arirang on my local setup and made a few changes necessary to make things work on trunk. KoreanAnalyzer now produces Korean tokens and some tests I've made passes when run from my IDE.

        Loading the dictionaries as resources need some work and I'll spend time on this during the weekend. I'll also address the headers, etc. to prepare for the incubator-general vote.

        Hopefully, I'll have all this on a branch this weekend. I'll keep you posted and we can take things from there.

        Show
        Christian Moen added a comment - A quick status update on my side is as follows: I've put the code into an a module called arirang on my local setup and made a few changes necessary to make things work on trunk . KoreanAnalyzer now produces Korean tokens and some tests I've made passes when run from my IDE. Loading the dictionaries as resources need some work and I'll spend time on this during the weekend. I'll also address the headers, etc. to prepare for the incubator-general vote. Hopefully, I'll have all this on a branch this weekend. I'll keep you posted and we can take things from there.
        Hide
        Steve Rowe added a comment -

        Steve, will there be a vote after the code has been checked onto the branch?

        Christian, before the VOTE on incubator-general can be called, the file header and licensing issues need to be completely addressed and vetted by us, working with SooMyung to make sure we get everything right.

        If you think the above is a good next step, I'm happy to start working on this either later this week or next week.

        +1. Thanks for working on this!

        Show
        Steve Rowe added a comment - Steve, will there be a vote after the code has been checked onto the branch? Christian, before the VOTE on incubator-general can be called, the file header and licensing issues need to be completely addressed and vetted by us, working with SooMyung to make sure we get everything right. If you think the above is a good next step, I'm happy to start working on this either later this week or next week. +1. Thanks for working on this!
        Hide
        Christian Moen added a comment -

        SooMyung, I don't think you need to do anything at this point. I think a good next step is that we create a new branch and check the code you have submitted onto that branch. We can then start looking into addressing the headers and other items that people have pointed out in comments. (Thanks, Jack and Edward!)

        Steve, will there be a vote after the code has been checked onto the branch? If you think the above is a good next step, I'm happy to start working on this either later this week or next week. Kindly let me know you prefer to proceed. Thanks.

        Show
        Christian Moen added a comment - SooMyung, I don't think you need to do anything at this point. I think a good next step is that we create a new branch and check the code you have submitted onto that branch. We can then start looking into addressing the headers and other items that people have pointed out in comments. (Thanks, Jack and Edward!) Steve, will there be a vote after the code has been checked onto the branch? If you think the above is a good next step, I'm happy to start working on this either later this week or next week. Kindly let me know you prefer to proceed. Thanks.
        Hide
        SooMyung Lee added a comment - - edited

        Hi Steve, What should I do in the present situation, Do I need to make a correction to all issues and submit new tarball? Please let me know what I have to do to move forward!

        Show
        SooMyung Lee added a comment - - edited Hi Steve, What should I do in the present situation, Do I need to make a correction to all issues and submit new tarball? Please let me know what I have to do to move forward!
        Hide
        Edward J. Yoon added a comment -

        I think this would be a valuable addition to the Apache Lucene (P.S., I'm Korean as you may know).

        It would be nice if you can remove all the korean comments or strings, and author tags in source code to avoid any compiling and installing problems. Otherwise, SVN server/client settings and build-script's encoding options etc. will be somewhat tricky. For example,

        if(entry!=null&&!("을".equals(end)&&entry.getFeature(WordEntry.IDX_REGURA)==IrregularUtil.IRR_TYPE_LIUL)) {
        
        and, 
        
        /**
         * 복합명사의 개별단어에 대한 정보를 담고있는 클래스 
         * @author S.M.Lee
         *
         */
        
        Show
        Edward J. Yoon added a comment - I think this would be a valuable addition to the Apache Lucene (P.S., I'm Korean as you may know). It would be nice if you can remove all the korean comments or strings, and author tags in source code to avoid any compiling and installing problems. Otherwise, SVN server/client settings and build-script's encoding options etc. will be somewhat tricky. For example, if (entry!= null &&!( "을" .equals(end)&&entry.getFeature(WordEntry.IDX_REGURA)==IrregularUtil.IRR_TYPE_LIUL)) { and, /** * 복합명사의 개별단어에 대한 정보를 담고있는 클래스 * @author S.M.Lee * */
        Hide
        Jack Krupansky added a comment -

        Looking at the actual tar file, I notice that it has the factory classes placed in "solr" directories rather than in the lucene directories as factories are normally organized.

        By all means proceed with producing a normal patch that shows the final organization of this new analysis package.

        Some other issues:

        1. Complete absence of Java doc for the tokenizer factory and token filter factory classes - it is not "Solr user-ready" at present. There should be an XML example of a token filter with the parameters, as is the usual practice in Lucene/Solr.

        2. No Apache license headers in the "Solr" code. I thought this stuff was already supposed to be ASL 2.0?

        3. No Solr schema.xml change to add the text_ko field type.

        4. At least the KoreanAnalyzer.java and KoreanTokenizer.java source code have tab characters - odd format. Need to be normalized for Lucene project conventions.

        5. There is a hardwired stop word list in KoreanAnalyzer that appears to be nearly identical or close to StopAnalyzer.ENGLISH_STOP_WORDS_SET. Why doesn't that static code copy the StopAnalyzer list and then add the few extra terms that are needed? If there is a reason, place it in a comment.

        But as I said, by all means proceed to a normal patch file now that the tar contribution is "legal".

        Show
        Jack Krupansky added a comment - Looking at the actual tar file, I notice that it has the factory classes placed in "solr" directories rather than in the lucene directories as factories are normally organized. By all means proceed with producing a normal patch that shows the final organization of this new analysis package. Some other issues: 1. Complete absence of Java doc for the tokenizer factory and token filter factory classes - it is not "Solr user-ready" at present. There should be an XML example of a token filter with the parameters, as is the usual practice in Lucene/Solr. 2. No Apache license headers in the "Solr" code. I thought this stuff was already supposed to be ASL 2.0? 3. No Solr schema.xml change to add the text_ko field type. 4. At least the KoreanAnalyzer.java and KoreanTokenizer.java source code have tab characters - odd format. Need to be normalized for Lucene project conventions. 5. There is a hardwired stop word list in KoreanAnalyzer that appears to be nearly identical or close to StopAnalyzer.ENGLISH_STOP_WORDS_SET. Why doesn't that static code copy the StopAnalyzer list and then add the few extra terms that are needed? If there is a reason, place it in a comment. But as I said, by all means proceed to a normal patch file now that the tar contribution is "legal".
        Hide
        Steve Rowe added a comment -

        As for where the analyzer code itself lives, I think it's fine to put it in lucene/analysis/arirang.

        +1

        I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean?

        +1

        Show
        Steve Rowe added a comment - As for where the analyzer code itself lives, I think it's fine to put it in lucene/analysis/arirang . +1 I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean? +1
        Hide
        Christian Moen added a comment -

        The Korean analyzer should be named named org.apache.lucene.analysis.kr.KoreanAnalyzer and we'll provide a ready-to-use field type text_kr in schema.xml for Solr users, which is consistent with what we do for other languages.

        As for where the analyzer code itself lives, I think it's fine to put it in lucene/analysis/arirang. The file lucene/analysis/README.txt documents what these modules are and the code is easily and directly retrievable in IDEs by looking up KoreanAnalyzer (the source code paths will be set up by ant eclipse and ant idea).

        One reason analyzers have not been put in {{lucene/analysis/common} in the past is that they require dictionaries that are several megabytes large.

        Overall, I don't think the scheme we are using is all that problematic, but it's true that MorfologikAnalyzer and SmartChineseAnalyzer doesn't align with it. The scheme doesn't easily lend itself to different implementations for one language, but that's not a common case today although it might become more common in the future.

        In the case of Norwegian (no), there are ISO language codes for both Bokmål (bm) and Nynorsk (nn), and one way of supporting this is also to consider these as options to NorwegianAnalyzer since both languages are Norwegian. See SOLR-4565 for thoughts on how to extend support in NorwegianMinimalStemFilter for this.

        A similar overall approach might make sense when there are multiple implementations of a language; end-users can use a analyzer named <Language>Analyzer without requiring users to study the difference in implementation before using. I also see problems with this, but it's just a thought...

        I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean?

        Show
        Christian Moen added a comment - The Korean analyzer should be named named org.apache.lucene.analysis.kr.KoreanAnalyzer and we'll provide a ready-to-use field type text_kr in schema.xml for Solr users, which is consistent with what we do for other languages. As for where the analyzer code itself lives, I think it's fine to put it in lucene/analysis/arirang . The file lucene/analysis/README.txt documents what these modules are and the code is easily and directly retrievable in IDEs by looking up KoreanAnalyzer (the source code paths will be set up by ant eclipse and ant idea ). One reason analyzers have not been put in {{lucene/analysis/common} in the past is that they require dictionaries that are several megabytes large. Overall, I don't think the scheme we are using is all that problematic, but it's true that MorfologikAnalyzer and SmartChineseAnalyzer doesn't align with it. The scheme doesn't easily lend itself to different implementations for one language, but that's not a common case today although it might become more common in the future. In the case of Norwegian (no), there are ISO language codes for both Bokmål (bm) and Nynorsk (nn), and one way of supporting this is also to consider these as options to NorwegianAnalyzer since both languages are Norwegian. See SOLR-4565 for thoughts on how to extend support in NorwegianMinimalStemFilter for this. A similar overall approach might make sense when there are multiple implementations of a language; end-users can use a analyzer named <Language>Analyzer without requiring users to study the difference in implementation before using. I also see problems with this, but it's just a thought... I'm all for improving our scheme, but perhaps we can open up a separate JIRA for this and keep this one focused on Korean?
        Hide
        Walter Underwood added a comment -

        Yes, including the ISO language code in the naming would be a very good idea. You still get into odd situations like Bokmal and Nynorsk, but you are still way ahead.

        Show
        Walter Underwood added a comment - Yes, including the ISO language code in the naming would be a very good idea. You still get into odd situations like Bokmal and Nynorsk, but you are still way ahead.
        Hide
        Jack Krupansky added a comment -

        The stempel and morfologik analysis modules are both Polish analyzers - if the first one had been named "polish", what would we have done with the second one?

        That's exactly what I was talking about.

        We have four distinct concepts:

        1. Module name.
        2. Package name.
        3. Source tree path.
        4. Module jar name.

        They should incorporate both the language code and the "implementation name" (e.g., "stempel" or "morphologik").

        The module should be something like "analysis/pl/stempel" or "analysis/stempel/pl". I prefer the former - it says that the first priority is to organize by language, and secondarily by implementation.

        And the package name should be something like "org.apache.lucene.analysis.pl.stempel" or "org.apache.lucene.analysis.stempel.pl". I prefer the former, for the same rationale as for module name.

        There seems to be a third form of name "analyzer-xxx". But as far as I can tell it is only an artifact of the doc or make some old Lucene thing.

        And then there are the partial names for the individual jar files. There seems to be both "lucene-analyzers-stempel-x.y.z" and "lucene-analyzers-morphologik-x.y.z" in contrib/lucene-libs and then multiple "morpologik-a.b.c" jars in contrib.lib.

        In short, to answer your question more directly, in my ideal world we would have srource tree and package names like:

        lucene/analysis/pl/stempel/src
        lucene/analysis/pl/morphologik/src
        lucene/analysis/ko/arirang/src

        org.apache.lucene.analysis.pl.stempel
        org.apache.lucene.analysis.pl.morfologik
        org.apache.lucene.analysis.ko.arirang

        This would allow multiple implementations for a single language in the same application.

        Although I could see reversing the language and implementation names if there is some need to share implementation code across languages.

        Show
        Jack Krupansky added a comment - The stempel and morfologik analysis modules are both Polish analyzers - if the first one had been named "polish", what would we have done with the second one? That's exactly what I was talking about. We have four distinct concepts: 1. Module name. 2. Package name. 3. Source tree path. 4. Module jar name. They should incorporate both the language code and the "implementation name" (e.g., "stempel" or "morphologik"). The module should be something like "analysis/pl/stempel" or "analysis/stempel/pl". I prefer the former - it says that the first priority is to organize by language, and secondarily by implementation. And the package name should be something like "org.apache.lucene.analysis.pl.stempel" or "org.apache.lucene.analysis.stempel.pl". I prefer the former, for the same rationale as for module name. There seems to be a third form of name "analyzer-xxx". But as far as I can tell it is only an artifact of the doc or make some old Lucene thing. And then there are the partial names for the individual jar files. There seems to be both "lucene-analyzers-stempel-x.y.z" and "lucene-analyzers-morphologik-x.y.z" in contrib/lucene-libs and then multiple "morpologik-a.b.c" jars in contrib.lib. In short, to answer your question more directly, in my ideal world we would have srource tree and package names like: lucene/analysis/pl/stempel/src lucene/analysis/pl/morphologik/src lucene/analysis/ko/arirang/src org.apache.lucene.analysis.pl.stempel org.apache.lucene.analysis.pl.morfologik org.apache.lucene.analysis.ko.arirang This would allow multiple implementations for a single language in the same application. Although I could see reversing the language and implementation names if there is some need to share implementation code across languages.
        Hide
        Steve Rowe added a comment -

        Jack, I think documentation can address most of your concerns. See e.g. the descriptions for the analyzer packages in the API javadocs section of the top-level per-release docs: http://lucene.apache.org/core/4_2_1/index.html. Fortunately, a module's name is not the only opportunity to describe its functionality.

        Even if you just called the module "korean", at least that would be a helpful guide to people like me browsing the list of modules. and then the package name can distinguish the implementations for that language.

        -1. The stempel and morfologik analysis modules are both Polish analyzers - if the first one had been named "polish", what would we have done with the second one?

        Also, it should be possible to mix multiple implementations for the same langauge in the same application, so, the package name does not to have some unique name, unless there is guaranteed to be only one implementation for that language.

        I agree that mixing same-language implementations should be possible in the same application. I have no idea what you're saying after that. Maybe an example?

        Show
        Steve Rowe added a comment - Jack, I think documentation can address most of your concerns. See e.g. the descriptions for the analyzer packages in the API javadocs section of the top-level per-release docs: http://lucene.apache.org/core/4_2_1/index.html . Fortunately, a module's name is not the only opportunity to describe its functionality. Even if you just called the module "korean", at least that would be a helpful guide to people like me browsing the list of modules. and then the package name can distinguish the implementations for that language. -1. The stempel and morfologik analysis modules are both Polish analyzers - if the first one had been named "polish", what would we have done with the second one? Also, it should be possible to mix multiple implementations for the same langauge in the same application, so, the package name does not to have some unique name, unless there is guaranteed to be only one implementation for that language. I agree that mixing same-language implementations should be possible in the same application. I have no idea what you're saying after that. Maybe an example?
        Hide
        Jack Krupansky added a comment -

        As a user trying to browse and find analyzers and tokenizers for specific languages, I object. I mean, I should be able to look at the language code and guess what module it might be in. It's one thing if the module name is reasonably general and there is a reasonable expectation that average users would readily associate it with specific langauges, or to categorically group languages, but just giving an artificial, non-obvious name to the module than would not be obvious to an average user seems like a poor choice, to me.

        Even if you just called the module "korean", at least that would be a helpful guide to people like me browsing the list of modules. and then the package name can distinguish the implementations for that language.

        Also, it should be possible to mix multiple implementations for the same langauge in the same application, so, the package name does not to have some unique name, unless there is guaranteed to be only one implementation for that language.

        I would suggest that there should be two choices for language-based analysis modules:

        1. Category name, where there is some general approach that covers a number of langauges and need to share classes.
        2. Language code, hyphen, some arbitrary name for implementations that cover only a single language.

        Even for #1, I would suggest that there should be a prefix that covers the "type" of languages covered (eastern european, asian, etc.)

        That said, I would not stand in the way of adding Korean analysis as soon as possible. I mean, this contribution shouldn't have to correct all of the sins of past contributions.

        Show
        Jack Krupansky added a comment - As a user trying to browse and find analyzers and tokenizers for specific languages, I object. I mean, I should be able to look at the language code and guess what module it might be in. It's one thing if the module name is reasonably general and there is a reasonable expectation that average users would readily associate it with specific langauges, or to categorically group languages, but just giving an artificial, non-obvious name to the module than would not be obvious to an average user seems like a poor choice, to me. Even if you just called the module "korean", at least that would be a helpful guide to people like me browsing the list of modules. and then the package name can distinguish the implementations for that language. Also, it should be possible to mix multiple implementations for the same langauge in the same application, so, the package name does not to have some unique name, unless there is guaranteed to be only one implementation for that language. I would suggest that there should be two choices for language-based analysis modules: 1. Category name, where there is some general approach that covers a number of langauges and need to share classes. 2. Language code, hyphen, some arbitrary name for implementations that cover only a single language. Even for #1, I would suggest that there should be a prefix that covers the "type" of languages covered (eastern european, asian, etc.) That said, I would not stand in the way of adding Korean analysis as soon as possible. I mean, this contribution shouldn't have to correct all of the sins of past contributions.
        Hide
        Steve Rowe added a comment -

        Hi Steve, I think "arirang" is the best name for the korean analysis modules. "arirang" is the name of traditional korean song. So, I think "arirang" can represent korean analysis modules well.

        Thanks SooMyung, "arirang" it is.

        Show
        Steve Rowe added a comment - Hi Steve, I think "arirang" is the best name for the korean analysis modules. "arirang" is the name of traditional korean song. So, I think "arirang" can represent korean analysis modules well. Thanks SooMyung, "arirang" it is.
        Hide
        SooMyung Lee added a comment -

        Hi Steve, I think "arirang" is the best name for the korean analysis modules. "arirang" is the name of traditional korean song. So, I think "arirang" can represent korean analysis modules well.

        Show
        SooMyung Lee added a comment - Hi Steve, I think "arirang" is the best name for the korean analysis modules. "arirang" is the name of traditional korean song. So, I think "arirang" can represent korean analysis modules well.
        Hide
        Steve Rowe added a comment -

        I think this donation should be packaged in its own jar, similarly to kuromoji, smartcn, morfologik and stempel, and so should end up at lucene/analysis/korean/.

        soomyung, do you have a good name for the analysis module this will become, rather than "korean"? I'd prefer a name that would allow us to add more Korean analysis modules in the future without having to rename this one.

        The Lucene PMC received notification today that SooMyung's code grant and ICLA paperwork have been received and recorded.

        Christian Moen, now that we have SooMyung's code grant and ICLA recorded, we can start making header modifications. I suggest we create a branch off trunk, create the new module there, check in the files from the tarball attached here, commit, iterate on headers/licensing, and finally hook the new module into the build.

        Show
        Steve Rowe added a comment - I think this donation should be packaged in its own jar, similarly to kuromoji, smartcn, morfologik and stempel, and so should end up at lucene/analysis/korean/. soomyung , do you have a good name for the analysis module this will become, rather than "korean"? I'd prefer a name that would allow us to add more Korean analysis modules in the future without having to rename this one. The Lucene PMC received notification today that SooMyung's code grant and ICLA paperwork have been received and recorded. Christian Moen , now that we have SooMyung's code grant and ICLA recorded, we can start making header modifications. I suggest we create a branch off trunk, create the new module there, check in the files from the tarball attached here, commit, iterate on headers/licensing, and finally hook the new module into the build.
        Hide
        Steve Rowe added a comment -

        The IP clearance form for this donation is here: http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html. I don't have karma to rebuild the website after I commit changes to the XML source, so there will be delays of a day or so between updates and those updates' appearance on the website.

        Show
        Steve Rowe added a comment - The IP clearance form for this donation is here: http://incubator.apache.org/ip-clearance/lucene-korean-analyzer.html . I don't have karma to rebuild the website after I commit changes to the XML source, so there will be delays of a day or so between updates and those updates' appearance on the website.
        Hide
        Dawid Weiss added a comment -

        That's because Christian has ninja superpowers.
        http://goo.gl/5EPMr

        Show
        Dawid Weiss added a comment - That's because Christian has ninja superpowers. http://goo.gl/5EPMr
        Hide
        soomyung added a comment -

        Thanks for your help and your great concern , Christian!

        I visited your website. I noticed that you are not a Japanese and you developed a Japanese Morphological Analyzer.

        How could it be possible? I'm surprising at your work.

        Show
        soomyung added a comment - Thanks for your help and your great concern , Christian! I visited your website. I noticed that you are not a Japanese and you developed a Japanese Morphological Analyzer. How could it be possible? I'm surprising at your work.
        Hide
        Christian Moen added a comment -

        Thanks again, SooMyung!

        I'm seeing that Steven has informed you about the grant process on the mailing list. I'm happy to also facilitate this process with Steven.

        Looking forward to getting Korean supported.

        Show
        Christian Moen added a comment - Thanks again, SooMyung! I'm seeing that Steven has informed you about the grant process on the mailing list. I'm happy to also facilitate this process with Steven. Looking forward to getting Korean supported.
        SooMyung Lee made changes -
        Attachment kr.analyzer.4x.tar [ 12580446 ]
        Hide
        SooMyung Lee added a comment -

        8edffacb15b3964f25054c82c0d4ea92

        Show
        SooMyung Lee added a comment - 8edffacb15b3964f25054c82c0d4ea92
        SooMyung Lee made changes -
        Field Original Value New Value
        Description Korean language has specific characteristic. When developing search service with lucene & solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is best idea to choose the korean analyzer. Korean language has specific characteristic. When developing search service with lucene & solr in korean, there are some problems in searching and indexing. The korean analyer solved the problems with a korean morphological anlyzer. It consists of a korean morphological analyzer, dictionaries, a korean tokenizer and a korean filter. The korean anlyzer is made for lucene and solr. If you develop a search service with lucene in korean, It is the best idea to choose the korean analyzer.
        SooMyung Lee created issue -

          People

          • Assignee:
            Christian Moen
            Reporter:
            SooMyung Lee
          • Votes:
            4 Vote for this issue
            Watchers:
            23 Start watching this issue

            Dates

            • Created:
              Updated:

              Development