Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-1728

Move SmartChineseAnalyzer & resources to own contrib project

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 2.9
    • modules/analysis
    • None
    • New

    Description

      SmartChineseAnalyzer depends on a large dictionary that causes the analyzer jar to grow up to 3MB. The dictionary is quite big compared to all the other resouces / class files contained in that jar.
      Having a separate analyzer-cn contrib project enables footprint-sensitive users (e.g. using lucene on a mobile phone) to include analyzer.jar without getting into trouble with disk space.

      Moving SmartChineseAnalyzer to a separate project could also include a small refactoring as Robert mentioned in LUCENE-1722 several classes should be package protected, members and classes could be final, commented syserr and logging code should be removed etc.

      I set this issue target to 2.9 - if we can not make it until then feel free to move it to 3.0

      Attachments

        1. LUCENE-1728.txt
          43 kB
          Robert Muir
        2. LUCENE-1728.txt
          43 kB
          Robert Muir
        3. LUCENE-1728.txt
          43 kB
          Robert Muir
        4. LUCENE-1728.txt
          53 kB
          Robert Muir

        Activity

          rcmuir Robert Muir added a comment -

          Simon, what do you think about name for the new folder?

          I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different.
          Should we name the new package analyzer-cnhmm or something to help clarify it?
          I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it.

          thanks

          rcmuir Robert Muir added a comment - Simon, what do you think about name for the new folder? I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different. Should we name the new package analyzer-cnhmm or something to help clarify it? I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it. thanks

          I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different. Should we name the new package analyzer-cnhmm or something to help clarify it?

          The name analyzer-cn was just an example though but I don't like analyzer-cnhmm. Whats about analyzer-smartcn? Definitly +1 for a less ambigious name.

          I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it.

          +1

          simonw Simon Willnauer added a comment - I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different. Should we name the new package analyzer-cnhmm or something to help clarify it? The name analyzer-cn was just an example though but I don't like analyzer-cnhmm. Whats about analyzer-smartcn? Definitly +1 for a less ambigious name. I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it. +1
          rcmuir Robert Muir added a comment -

          Simon, analyzer-smartcn works, and its consistent with the name of the analyzer.

          If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert.

          rcmuir Robert Muir added a comment - Simon, analyzer-smartcn works, and its consistent with the name of the analyzer. If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert.
          simonw Simon Willnauer added a comment - - edited

          If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert.

          No. an svn copy (or svn move) will not be reflected in a patch. If you "svn move A B" and create a patch from your local WC the history of A will be lost. I guess I should do the moveing and commit it, the refactoring should be done afterwards. Would that make sense to you?!

          simon

          simonw Simon Willnauer added a comment - - edited If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert. No. an svn copy (or svn move) will not be reflected in a patch. If you "svn move A B" and create a patch from your local WC the history of A will be lost. I guess I should do the moveing and commit it, the refactoring should be done afterwards. Would that make sense to you?! simon
          rcmuir Robert Muir added a comment -

          simon, i think i've almost got it ready. i've got a set of svn move's etc you can run first, then a patch to apply over it.

          this way also the patch only reflects the real changes, and you can review what these are before changing anything in SVN...

          i'll upload it soon...

          rcmuir Robert Muir added a comment - simon, i think i've almost got it ready. i've got a set of svn move's etc you can run first, then a patch to apply over it. this way also the patch only reflects the real changes, and you can review what these are before changing anything in SVN... i'll upload it soon...
          uschindler Uwe Schindler added a comment -

          After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...).

          Another idea: We can do it without creating a new contrib, instead do it like the contrib-bdb, which consists of 2 sub-contribs. Here the contrib folder of bdb is divided into two sub-folders, the build.xml of the main folder is just a "delegator" (or how you would call it) and delegates the ant targets to the build.xmls in the sub-folders. Using this approach we would still have only one contrib-analyzers main folder with two subdirs, which are two separate contribs modules (like the two bdb ones), but are in one folder.

          This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs.

          uschindler Uwe Schindler added a comment - After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...). Another idea: We can do it without creating a new contrib, instead do it like the contrib-bdb, which consists of 2 sub-contribs. Here the contrib folder of bdb is divided into two sub-folders, the build.xml of the main folder is just a "delegator" (or how you would call it) and delegates the ant targets to the build.xmls in the sub-folders. Using this approach we would still have only one contrib-analyzers main folder with two subdirs, which are two separate contribs modules (like the two bdb ones), but are in one folder. This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs.

          After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...).

          Uwe, I will not be able to commit those changes I guess. This reminds me that contrib commiters should have access to those files too. Once I get this change in I will notify you with a patch so you can get it in.

          This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs.

          I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me.

          simonw Simon Willnauer added a comment - After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...). Uwe, I will not be able to commit those changes I guess. This reminds me that contrib commiters should have access to those files too. Once I get this change in I will notify you with a patch so you can get it in. This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs. I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me.
          rcmuir Robert Muir added a comment -

          I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me.

          Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib?

          rcmuir Robert Muir added a comment - I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me. Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib?

          Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib?

          I refer to Uwe's approach.

          simonw Simon Willnauer added a comment - Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib? I refer to Uwe's approach.
          rcmuir Robert Muir added a comment -

          great, I like this too. any preference on names? contrib/analyzers/analyzers and contrib/analyzers/smartcn?

          once we figure that out i can create updated set of svn moves + patch for this approach

          rcmuir Robert Muir added a comment - great, I like this too. any preference on names? contrib/analyzers/analyzers and contrib/analyzers/smartcn? once we figure that out i can create updated set of svn moves + patch for this approach

          contrib/analyzers/analyzers and contrib/analyzers/smartcn?

          +1

          go ahead!

          simonw Simon Willnauer added a comment - contrib/analyzers/analyzers and contrib/analyzers/smartcn? +1 go ahead!
          rcmuir Robert Muir added a comment -

          Simon, below is the method I used to do the refactoring with this patch.
          I know I am pressing the limits of what is a "refactoring" but in my opinion, this minor cleanup was necessary to prevent internal structures from being exposed:

          • Use of two Tokenizers in the same analyzer was confusing, WordTokenizer is now a TokenFilter.
          • Analyzer uses the standard WordListLoader rather than custom stuff.
          • Rather than force SmartChineseAnalyzer to keep track of internal heavyweight structures, it implements reusableTokenStream, etc.

          I added a few tests to ensure I didn't break anything in the SmartChineseAnalyzer.

          ## 1. clean svn checkout
          ## 2. run the following commands to refactor the files.
          
          mkdir -p contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn
          svn add contrib/analysis
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn
          svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart
          svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn
          svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn
          svn move contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java
          svn move contrib/analyzers contrib/analysis
          
          ## 3. eclipse "refresh" at project level.
          ## 4. set text-file encoding at project level to UTF-8
          ## 5. manually force text-file encoding as UTF-8 for contrib/analysis/analyzers/src/java/org/apache/lucene/analysis/cn/package.html
          ##   this is an existing encoding issue that is corrected by this patch.
          ## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)
          
          
          rcmuir Robert Muir added a comment - Simon, below is the method I used to do the refactoring with this patch. I know I am pressing the limits of what is a "refactoring" but in my opinion, this minor cleanup was necessary to prevent internal structures from being exposed: Use of two Tokenizers in the same analyzer was confusing, WordTokenizer is now a TokenFilter. Analyzer uses the standard WordListLoader rather than custom stuff. Rather than force SmartChineseAnalyzer to keep track of internal heavyweight structures, it implements reusableTokenStream, etc. I added a few tests to ensure I didn't break anything in the SmartChineseAnalyzer. ## 1. clean svn checkout ## 2. run the following commands to refactor the files. mkdir -p contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn svn add contrib/analysis svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn svn move contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java svn move contrib/analyzers contrib/analysis ## 3. eclipse "refresh" at project level. ## 4. set text-file encoding at project level to UTF-8 ## 5. manually force text-file encoding as UTF-8 for contrib/analysis/analyzers/src/java/org/apache/lucene/analysis/cn/package.html ## this is an existing encoding issue that is corrected by this patch. ## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)

          Robert, thanks for all that! I just had a brief look at it but looks good so far. I need to look over it again in the next days.
          Plan to commit it this week.

          simon

          simonw Simon Willnauer added a comment - Robert, thanks for all that! I just had a brief look at it but looks good so far. I need to look over it again in the next days. Plan to commit it this week. simon

          Robert, I don't think we should rename the directory analyzers to analysis I would rather go for analyzers/common and analyzers/smartcn or a similar scheme.

          simon

          simonw Simon Willnauer added a comment - Robert, I don't think we should rename the directory analyzers to analysis I would rather go for analyzers/common and analyzers/smartcn or a similar scheme. simon
          rcmuir Robert Muir added a comment -

          Simon, sounds good. I will update the patch / svn commands with that scheme (hopefully in the next day or 2)

          rcmuir Robert Muir added a comment - Simon, sounds good. I will update the patch / svn commands with that scheme (hopefully in the next day or 2)

          cool thanks!

          simonw Simon Willnauer added a comment - cool thanks!
          rcmuir Robert Muir added a comment -

          Simon, I revised the patch. Here are the new instructions for the analyzers/common and analyzers/smartcn scheme.
          Sorry for the delay.

          ## 1. clean svn checkout
          ## 2. run the following commands to refactor the files.
          
          mkdir contrib/analyzers/common
          mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
          svn add contrib/analyzers/smartcn contrib/analyzers/common
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
          svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart
          svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
          svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java
          svn move contrib/analyzers/build.xml contrib/analyzers/common
          svn move contrib/analyzers/pom.xml.template contrib/analyzers/common
          svn move contrib/analyzers/src contrib/analyzers/common
          
          ## 3. eclipse "refresh" at project level.
          ## 4. set text-file encoding at project level to UTF-8
          ## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html
          ##   this is an existing encoding issue that is corrected by this patch.
          ## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)
          
          rcmuir Robert Muir added a comment - Simon, I revised the patch. Here are the new instructions for the analyzers/common and analyzers/smartcn scheme. Sorry for the delay. ## 1. clean svn checkout ## 2. run the following commands to refactor the files. mkdir contrib/analyzers/common mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn add contrib/analyzers/smartcn contrib/analyzers/common svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java svn move contrib/analyzers/build.xml contrib/analyzers/common svn move contrib/analyzers/pom.xml.template contrib/analyzers/common svn move contrib/analyzers/src contrib/analyzers/common ## 3. eclipse "refresh" at project level. ## 4. set text-file encoding at project level to UTF-8 ## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ package .html ## this is an existing encoding issue that is corrected by this patch. ## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)
          rcmuir Robert Muir added a comment -

          same patch, but this time i clicked ASF license... sorry!

          rcmuir Robert Muir added a comment - same patch, but this time i clicked ASF license... sorry!

          Robert, I have looked at this patch and more important at the source itself and I get more and more the impression that we have to do more work on this analyzer and the related classes as just moving them into one package and make everything package private. From my understanding the Hidden Markov Model Segmenter is a feature which could be replaced by some other algorithm. Once you have such a feature relationship I would prefer packages by feature which enables you to remove a single feature just by removing a whole package.
          In other words I would love to see a general refactoring of the code which exploits a tiny but common API in the base package and is subsequently used by the HHMM "feature". There is quite a bit of work to do that I do not consider 2.9 work.
          So here is the question, do we keep the structure as it is and just move it to a new subdir to build a sep. jar or do we move them into one single package (as you did in the patch) and build up a clean HHMM package later in 3.*.

          Beside the packaging I found heaps of things I do not like very much in the code (not your patch an my fingertips getting nervous when I see stuff like the AbstractDictionary hierarchy or those Singletions. I would really like to have this separation of CN and common Analyzers in for 2.9 – we just need to decide which way we go. I guess moving it over without changing code would be easiest.

          simon

          simonw Simon Willnauer added a comment - Robert, I have looked at this patch and more important at the source itself and I get more and more the impression that we have to do more work on this analyzer and the related classes as just moving them into one package and make everything package private. From my understanding the Hidden Markov Model Segmenter is a feature which could be replaced by some other algorithm. Once you have such a feature relationship I would prefer packages by feature which enables you to remove a single feature just by removing a whole package. In other words I would love to see a general refactoring of the code which exploits a tiny but common API in the base package and is subsequently used by the HHMM "feature". There is quite a bit of work to do that I do not consider 2.9 work. So here is the question, do we keep the structure as it is and just move it to a new subdir to build a sep. jar or do we move them into one single package (as you did in the patch) and build up a clean HHMM package later in 3.*. Beside the packaging I found heaps of things I do not like very much in the code (not your patch an my fingertips getting nervous when I see stuff like the AbstractDictionary hierarchy or those Singletions. I would really like to have this separation of CN and common Analyzers in for 2.9 – we just need to decide which way we go. I guess moving it over without changing code would be easiest. simon
          rcmuir Robert Muir added a comment -

          Simon, I agree with you, there is a ton of work to be done.

          I also did not particularly like my method of moving everything into one package to hide the internals... and I 100% agree that a "correct" refactoring is quite a bit of work.

          I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also.

          • removal of GB2312 dictionary dependency: this limits functionality to simplified chinese.
          • use of unicode categories (java Character class, etc) versus Utility.getCharType()
          • support for codepoints outside of BMP, this is necessary to support traditional chinese.
          • a little more flexibility with tokenization, honestly I'm really not sold on indexing "words" for chinese in the first place. But words + bigrams (overlapping tokens), that would be nice.

          In the future it would be nice to add support for traditional chinese, and there is frequency data out there (libtabe: BSD license, etc), but we need to refactor first.

          As far as what to do for 2.9... I really don't know either, just let me know if you need a new patch

          rcmuir Robert Muir added a comment - Simon, I agree with you, there is a ton of work to be done. I also did not particularly like my method of moving everything into one package to hide the internals... and I 100% agree that a "correct" refactoring is quite a bit of work. I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also. removal of GB2312 dictionary dependency: this limits functionality to simplified chinese. use of unicode categories (java Character class, etc) versus Utility.getCharType() support for codepoints outside of BMP, this is necessary to support traditional chinese. a little more flexibility with tokenization, honestly I'm really not sold on indexing "words" for chinese in the first place. But words + bigrams (overlapping tokens), that would be nice. In the future it would be nice to add support for traditional chinese, and there is frequency data out there (libtabe: BSD license, etc), but we need to refactor first. As far as what to do for 2.9... I really don't know either, just let me know if you need a new patch

          I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also.

          pushing things forward is not complaining to me. I agree with you points I did not look closely into implementation details but rather on structural things. Apparently we both agree that we have work to do on this and I guess we can work out good solutions in the future together. Let's just move the classes into it's own subdir as you already did and keep the structure as it is (with the smallest changes - some classes have to be moved). If you could provide a patch I will commit the refactoring and we open a new issue for 3.*.
          This solution seems to be ideal as 2.9 release is quite close...

          simon

          simonw Simon Willnauer added a comment - I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also. pushing things forward is not complaining to me. I agree with you points I did not look closely into implementation details but rather on structural things. Apparently we both agree that we have work to do on this and I guess we can work out good solutions in the future together. Let's just move the classes into it's own subdir as you already did and keep the structure as it is (with the smallest changes - some classes have to be moved). If you could provide a patch I will commit the refactoring and we open a new issue for 3.*. This solution seems to be ideal as 2.9 release is quite close... simon
          rcmuir Robert Muir added a comment -

          Simon OK, I will work on a patch that tries to maintain the package structure.

          Other than package structure, is there anything in the patch you are uncomfortable with?
          I can either try to unfix any small fixes you don't like or create more testcases, whatever makes sense.

          rcmuir Robert Muir added a comment - Simon OK, I will work on a patch that tries to maintain the package structure. Other than package structure, is there anything in the patch you are uncomfortable with? I can either try to unfix any small fixes you don't like or create more testcases, whatever makes sense.

          Other than package structure, is there anything in the patch you are uncomfortable with?

          no that I could tell. You can keep whatever applies to the package structure - means we might have to keep some classes public etc.

          thanks for your patience! Good job!

          simon

          simonw Simon Willnauer added a comment - Other than package structure, is there anything in the patch you are uncomfortable with? no that I could tell. You can keep whatever applies to the package structure - means we might have to keep some classes public etc. thanks for your patience! Good job! simon
          rcmuir Robert Muir added a comment -

          Simon, yes some things may have to be public that should not be due to the package structure.

          I'll see if I can improve the javadocs for anything that falls in this situation as a short-term workaround.

          rcmuir Robert Muir added a comment - Simon, yes some things may have to be public that should not be due to the package structure. I'll see if I can improve the javadocs for anything that falls in this situation as a short-term workaround.
          michaelbusch Michael Busch added a comment -

          So we are going to move everything currently under contrib/analyzers to contrib/analyzers/common?

          michaelbusch Michael Busch added a comment - So we are going to move everything currently under contrib/analyzers to contrib/analyzers/common?
          rcmuir Robert Muir added a comment -

          yeah, except smart chinese analyzer. I am testing the latest patch (that keeps the previous smart chinese analyzer package structure), regenerating docs, etc etc.

          I will upload it in a few when I think it is good to go.

          rcmuir Robert Muir added a comment - yeah, except smart chinese analyzer. I am testing the latest patch (that keeps the previous smart chinese analyzer package structure), regenerating docs, etc etc. I will upload it in a few when I think it is good to go.
          rcmuir Robert Muir added a comment -

          Simon, here is the new patch. It also has the changes to build.xml and site.xml so that javadocs are correctly linked, and the regenerated docs.

          ## 1. clean svn checkout
          ## 2. run the following commands to refactor the files.
          
          mkdir contrib/analyzers/common
          mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis
          svn add contrib/analyzers/smartcn contrib/analyzers/common
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn
          svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis
          svn copy contrib/analyzers/build.xml contrib/analyzers/common
          svn move contrib/analyzers/pom.xml.template contrib/analyzers/common
          svn move contrib/analyzers/src contrib/analyzers/common
          svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenFilter.java
          
          ## 3. eclipse "refresh" at project level.
          ## 4. set text-file encoding at project level to UTF-8
          ## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html
          ##   also manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/package.html
          ##   this is an existing encoding issue that is corrected by this patch.
          ## 6. apply patch from clipboard (you may now remove the above hack and you will notice the above files are now detected properly as UTF-8)
          
          rcmuir Robert Muir added a comment - Simon, here is the new patch. It also has the changes to build.xml and site.xml so that javadocs are correctly linked, and the regenerated docs. ## 1. clean svn checkout ## 2. run the following commands to refactor the files. mkdir contrib/analyzers/common mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis svn add contrib/analyzers/smartcn contrib/analyzers/common svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis svn copy contrib/analyzers/build.xml contrib/analyzers/common svn move contrib/analyzers/pom.xml.template contrib/analyzers/common svn move contrib/analyzers/src contrib/analyzers/common svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenFilter.java ## 3. eclipse "refresh" at project level. ## 4. set text-file encoding at project level to UTF-8 ## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html ## also manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/package.html ## this is an existing encoding issue that is corrected by this patch. ## 6. apply patch from clipboard (you may now remove the above hack and you will notice the above files are now detected properly as UTF-8)

          Robert, I just committed your patch. Thanks a lot for that.
          I added equals and hashcode methods to the classes you removed them just in case.

          @ Uwe(or some other core commiter): could you please prepare the documentation and top level build.xml and commit it, thanks! I think robert already prepared everything in his patch.

          simonw Simon Willnauer added a comment - Robert, I just committed your patch. Thanks a lot for that. I added equals and hashcode methods to the classes you removed them just in case. @ Uwe(or some other core commiter): could you please prepare the documentation and top level build.xml and commit it, thanks! I think robert already prepared everything in his patch.
          rcmuir Robert Muir added a comment -

          Simon, thanks!

          oh, the equals and hashcode were commented out in the original src (I removed the commented lines).

          I was afraid to uncomment them (I didnt know why they were commented out),
          but I shouldn't have deleted the commented lines... thanks for resolving this.

          rcmuir Robert Muir added a comment - Simon, thanks! oh, the equals and hashcode were commented out in the original src (I removed the commented lines). I was afraid to uncomment them (I didnt know why they were commented out), but I shouldn't have deleted the commented lines... thanks for resolving this.

          I'll commit the top-level changes for the web-site. Thanks Robert!

          mikemccand Michael McCandless added a comment - I'll commit the top-level changes for the web-site. Thanks Robert!

          I'll commit the top-level changes for the web-site. Thanks Robert!

          thanks mike!

          simonw Simon Willnauer added a comment - I'll commit the top-level changes for the web-site. Thanks Robert! thanks mike!
          uschindler Uwe Schindler added a comment - - edited

          I committed a fix for the incorrect javadocs dirs for contrib/analysis in the main build.xml.
          Revision: 797213

          uschindler Uwe Schindler added a comment - - edited I committed a fix for the incorrect javadocs dirs for contrib/analysis in the main build.xml. Revision: 797213
          tomoko Tomoko Uchida added a comment -

          This issue was moved to GitHub issue: #2802.

          tomoko Tomoko Uchida added a comment - This issue was moved to GitHub issue: #2802 .

          People

            simonw Simon Willnauer
            simonw Simon Willnauer
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: