[LUCENE-1728] Move SmartChineseAnalyzer & resources to own contrib project - ASF JIRA

Robert Muir added a comment - 01/Jul/09 18:00

Simon, what do you think about name for the new folder?

I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different.
Should we name the new package analyzer-cnhmm or something to help clarify it?
I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it.

thanks

Robert Muir added a comment - 01/Jul/09 18:00 Simon, what do you think about name for the new folder? I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different. Should we name the new package analyzer-cnhmm or something to help clarify it? I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it. thanks

Simon Willnauer added a comment - 01/Jul/09 18:58

I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different. Should we name the new package analyzer-cnhmm or something to help clarify it?

The name analyzer-cn was just an example though but I don't like analyzer-cnhmm. Whats about analyzer-smartcn? Definitly +1 for a less ambigious name.

I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it.

+1

Simon Willnauer added a comment - 01/Jul/09 18:58 I am concerned users will be confused between analyzers/cjk, analyzers/cn and analyzer-cn, all of which are different. Should we name the new package analyzer-cnhmm or something to help clarify it? The name analyzer-cn was just an example though but I don't like analyzer-cnhmm. Whats about analyzer-smartcn? Definitly +1 for a less ambigious name. I intend to also add a little wordage to the javadocs to help disambiguate this, whatever we decide to name it. +1

Robert Muir added a comment - 01/Jul/09 19:08

Simon, analyzer-smartcn works, and its consistent with the name of the analyzer.

If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert.

Robert Muir added a comment - 01/Jul/09 19:08 Simon, analyzer-smartcn works, and its consistent with the name of the analyzer. If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert.

Simon Willnauer added a comment - 02/Jul/09 04:13 - edited

If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert.

No. an svn copy (or svn move) will not be reflected in a patch. If you "svn move A B" and create a patch from your local WC the history of A will be lost. I guess I should do the moveing and commit it, the refactoring should be done afterwards. Would that make sense to you?!

simon

Simon Willnauer added a comment - 02/Jul/09 04:13 - edited If i "svn move" the files in my local, and submit a patch, will it ensure that history is preserved? I am not an svn expert. No. an svn copy (or svn move) will not be reflected in a patch. If you "svn move A B" and create a patch from your local WC the history of A will be lost. I guess I should do the moveing and commit it, the refactoring should be done afterwards. Would that make sense to you?! simon

Robert Muir added a comment - 02/Jul/09 05:19

simon, i think i've almost got it ready. i've got a set of svn move's etc you can run first, then a patch to apply over it.

this way also the patch only reflects the real changes, and you can review what these are before changing anything in SVN...

i'll upload it soon...

Robert Muir added a comment - 02/Jul/09 05:19 simon, i think i've almost got it ready. i've got a set of svn move's etc you can run first, then a patch to apply over it. this way also the patch only reflects the real changes, and you can review what these are before changing anything in SVN... i'll upload it soon...

Uwe Schindler added a comment - 02/Jul/09 06:44

After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...).

Another idea: We can do it without creating a new contrib, instead do it like the contrib-bdb, which consists of 2 sub-contribs. Here the contrib folder of bdb is divided into two sub-folders, the build.xml of the main folder is just a "delegator" (or how you would call it) and delegates the ant targets to the build.xmls in the sub-folders. Using this approach we would still have only one contrib-analyzers main folder with two subdirs, which are two separate contribs modules (like the two bdb ones), but are in one folder.

This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs.

Uwe Schindler added a comment - 02/Jul/09 06:44 After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...). Another idea: We can do it without creating a new contrib, instead do it like the contrib-bdb, which consists of 2 sub-contribs. Here the contrib folder of bdb is divided into two sub-folders, the build.xml of the main folder is just a "delegator" (or how you would call it) and delegates the ant targets to the build.xmls in the sub-folders. Using this approach we would still have only one contrib-analyzers main folder with two subdirs, which are two separate contribs modules (like the two bdb ones), but are in one folder. This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs.

Simon Willnauer added a comment - 02/Jul/09 07:14

After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...).

Uwe, I will not be able to commit those changes I guess. This reminds me that contrib commiters should have access to those files too. Once I get this change in I will notify you with a patch so you can get it in.

This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs.

I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me.

Simon Willnauer added a comment - 02/Jul/09 07:14 After creating the new contrib, do not forget to add the javadocs generation of the "all/" subdir in the main build.xml! Also new contribs should be added to the developers part in the site docs and so on. I can do that if you like after committing the whole thing (I have done it several times the last months for spatial, trie,...). Uwe, I will not be able to commit those changes I guess. This reminds me that contrib commiters should have access to those files too. Once I get this change in I will notify you with a patch so you can get it in. This approach is only good for source code, the user still gets the jar files in the main build folder directly under contrib. So I am not sure, if this is really better than two really separate contribs. I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me.

Robert Muir added a comment - 02/Jul/09 12:10

I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me.

Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib?

Robert Muir added a comment - 02/Jul/09 12:10 I really like this approach as it keeps the code logically consistent. I think we should go for this approach, that makes much more sense to me. Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib?

Simon Willnauer added a comment - 02/Jul/09 12:21

Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib?

I refer to Uwe's approach.

Simon Willnauer added a comment - 02/Jul/09 12:21 Simon, are you referring to Uwe's approach of splitting the analyzers contrib into two, or your (previous) approach of analyzer-smartcn contrib? I refer to Uwe's approach.

Robert Muir added a comment - 02/Jul/09 12:36

great, I like this too. any preference on names? contrib/analyzers/analyzers and contrib/analyzers/smartcn?

once we figure that out i can create updated set of svn moves + patch for this approach

Robert Muir added a comment - 02/Jul/09 12:36 great, I like this too. any preference on names? contrib/analyzers/analyzers and contrib/analyzers/smartcn? once we figure that out i can create updated set of svn moves + patch for this approach

Simon Willnauer added a comment - 03/Jul/09 07:55

contrib/analyzers/analyzers and contrib/analyzers/smartcn?

+1

go ahead!

Simon Willnauer added a comment - 03/Jul/09 07:55 contrib/analyzers/analyzers and contrib/analyzers/smartcn? +1 go ahead!

Robert Muir added a comment - 06/Jul/09 02:01

Simon, below is the method I used to do the refactoring with this patch.
I know I am pressing the limits of what is a "refactoring" but in my opinion, this minor cleanup was necessary to prevent internal structures from being exposed:

Use of two Tokenizers in the same analyzer was confusing, WordTokenizer is now a TokenFilter.
Analyzer uses the standard WordListLoader rather than custom stuff.
Rather than force SmartChineseAnalyzer to keep track of internal heavyweight structures, it implements reusableTokenStream, etc.

I added a few tests to ensure I didn't break anything in the SmartChineseAnalyzer.

## 1. clean svn checkout
## 2. run the following commands to refactor the files.

mkdir -p contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn
svn add contrib/analysis
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart
svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn
svn move contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java
svn move contrib/analyzers contrib/analysis

## 3. eclipse "refresh" at project level.
## 4. set text-file encoding at project level to UTF-8
## 5. manually force text-file encoding as UTF-8 for contrib/analysis/analyzers/src/java/org/apache/lucene/analysis/cn/package.html
##   this is an existing encoding issue that is corrected by this patch.
## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)

Robert Muir added a comment - 06/Jul/09 02:01 Simon, below is the method I used to do the refactoring with this patch. I know I am pressing the limits of what is a "refactoring" but in my opinion, this minor cleanup was necessary to prevent internal structures from being exposed: Use of two Tokenizers in the same analyzer was confusing, WordTokenizer is now a TokenFilter. Analyzer uses the standard WordListLoader rather than custom stuff. Rather than force SmartChineseAnalyzer to keep track of internal heavyweight structures, it implements reusableTokenStream, etc. I added a few tests to ensure I didn't break anything in the SmartChineseAnalyzer. ## 1. clean svn checkout ## 2. run the following commands to refactor the files. mkdir -p contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn svn add contrib/analysis svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analysis/smartcn/src/test/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analysis/smartcn/src/resources/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn svn move contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analysis/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java svn move contrib/analyzers contrib/analysis ## 3. eclipse "refresh" at project level. ## 4. set text-file encoding at project level to UTF-8 ## 5. manually force text-file encoding as UTF-8 for contrib/analysis/analyzers/src/java/org/apache/lucene/analysis/cn/package.html ## this is an existing encoding issue that is corrected by this patch. ## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)

Simon Willnauer added a comment - 07/Jul/09 18:07

Robert, thanks for all that! I just had a brief look at it but looks good so far. I need to look over it again in the next days.
Plan to commit it this week.

simon

Simon Willnauer added a comment - 07/Jul/09 18:07 Robert, thanks for all that! I just had a brief look at it but looks good so far. I need to look over it again in the next days. Plan to commit it this week. simon

Simon Willnauer added a comment - 15/Jul/09 08:55

Robert, I don't think we should rename the directory analyzers to analysis I would rather go for analyzers/common and analyzers/smartcn or a similar scheme.

simon

Simon Willnauer added a comment - 15/Jul/09 08:55 Robert, I don't think we should rename the directory analyzers to analysis I would rather go for analyzers/common and analyzers/smartcn or a similar scheme. simon

Robert Muir added a comment - 15/Jul/09 09:04

Simon, sounds good. I will update the patch / svn commands with that scheme (hopefully in the next day or 2)

Robert Muir added a comment - 15/Jul/09 09:04 Simon, sounds good. I will update the patch / svn commands with that scheme (hopefully in the next day or 2)

Simon Willnauer added a comment - 15/Jul/09 09:21

cool thanks!

Simon Willnauer added a comment - 15/Jul/09 09:21 cool thanks!

Robert Muir added a comment - 16/Jul/09 18:48

Simon, I revised the patch. Here are the new instructions for the analyzers/common and analyzers/smartcn scheme.
Sorry for the delay.

## 1. clean svn checkout
## 2. run the following commands to refactor the files.

mkdir contrib/analyzers/common
mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn add contrib/analyzers/smartcn contrib/analyzers/common
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart
svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn
svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn
svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java
svn move contrib/analyzers/build.xml contrib/analyzers/common
svn move contrib/analyzers/pom.xml.template contrib/analyzers/common
svn move contrib/analyzers/src contrib/analyzers/common

## 3. eclipse "refresh" at project level.
## 4. set text-file encoding at project level to UTF-8
## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html
##   this is an existing encoding issue that is corrected by this patch.
## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)

Robert Muir added a comment - 16/Jul/09 18:48 Simon, I revised the patch. Here are the new instructions for the analyzers/common and analyzers/smartcn scheme. Sorry for the delay. ## 1. clean svn checkout ## 2. run the following commands to refactor the files. mkdir contrib/analyzers/common mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn add contrib/analyzers/smartcn contrib/analyzers/common svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart/*.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/stopwords.txt contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn/smart/hhmm/* contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis/cn svn delete contrib/analyzers/src/resources/org/apache/lucene/analysis/cn svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/WordTokenFilter.java svn move contrib/analyzers/build.xml contrib/analyzers/common svn move contrib/analyzers/pom.xml.template contrib/analyzers/common svn move contrib/analyzers/src contrib/analyzers/common ## 3. eclipse "refresh" at project level. ## 4. set text-file encoding at project level to UTF-8 ## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/ package .html ## this is an existing encoding issue that is corrected by this patch. ## 6. apply patch from clipboard (you may now remove the above hack and you will notice this file is now detected properly as UTF-8)

Robert Muir added a comment - 16/Jul/09 18:50

same patch, but this time i clicked ASF license... sorry!

Robert Muir added a comment - 16/Jul/09 18:50 same patch, but this time i clicked ASF license... sorry!

Simon Willnauer added a comment - 21/Jul/09 08:52

Robert, I have looked at this patch and more important at the source itself and I get more and more the impression that we have to do more work on this analyzer and the related classes as just moving them into one package and make everything package private. From my understanding the Hidden Markov Model Segmenter is a feature which could be replaced by some other algorithm. Once you have such a feature relationship I would prefer packages by feature which enables you to remove a single feature just by removing a whole package.
In other words I would love to see a general refactoring of the code which exploits a tiny but common API in the base package and is subsequently used by the HHMM "feature". There is quite a bit of work to do that I do not consider 2.9 work.
So here is the question, do we keep the structure as it is and just move it to a new subdir to build a sep. jar or do we move them into one single package (as you did in the patch) and build up a clean HHMM package later in 3.*.

Beside the packaging I found heaps of things I do not like very much in the code (not your patch an my fingertips getting nervous when I see stuff like the AbstractDictionary hierarchy or those Singletions. I would really like to have this separation of CN and common Analyzers in for 2.9 – we just need to decide which way we go. I guess moving it over without changing code would be easiest.

simon

Simon Willnauer added a comment - 21/Jul/09 08:52 Robert, I have looked at this patch and more important at the source itself and I get more and more the impression that we have to do more work on this analyzer and the related classes as just moving them into one package and make everything package private. From my understanding the Hidden Markov Model Segmenter is a feature which could be replaced by some other algorithm. Once you have such a feature relationship I would prefer packages by feature which enables you to remove a single feature just by removing a whole package. In other words I would love to see a general refactoring of the code which exploits a tiny but common API in the base package and is subsequently used by the HHMM "feature". There is quite a bit of work to do that I do not consider 2.9 work. So here is the question, do we keep the structure as it is and just move it to a new subdir to build a sep. jar or do we move them into one single package (as you did in the patch) and build up a clean HHMM package later in 3.*. Beside the packaging I found heaps of things I do not like very much in the code (not your patch an my fingertips getting nervous when I see stuff like the AbstractDictionary hierarchy or those Singletions. I would really like to have this separation of CN and common Analyzers in for 2.9 – we just need to decide which way we go. I guess moving it over without changing code would be easiest. simon

Robert Muir added a comment - 21/Jul/09 09:19

Simon, I agree with you, there is a ton of work to be done.

I also did not particularly like my method of moving everything into one package to hide the internals... and I 100% agree that a "correct" refactoring is quite a bit of work.

I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also.

removal of GB2312 dictionary dependency: this limits functionality to simplified chinese.
use of unicode categories (java Character class, etc) versus Utility.getCharType()
support for codepoints outside of BMP, this is necessary to support traditional chinese.
a little more flexibility with tokenization, honestly I'm really not sold on indexing "words" for chinese in the first place. But words + bigrams (overlapping tokens), that would be nice.

In the future it would be nice to add support for traditional chinese, and there is frequency data out there (libtabe: BSD license, etc), but we need to refactor first.

As far as what to do for 2.9... I really don't know either, just let me know if you need a new patch

Robert Muir added a comment - 21/Jul/09 09:19 Simon, I agree with you, there is a ton of work to be done. I also did not particularly like my method of moving everything into one package to hide the internals... and I 100% agree that a "correct" refactoring is quite a bit of work. I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also. removal of GB2312 dictionary dependency: this limits functionality to simplified chinese. use of unicode categories (java Character class, etc) versus Utility.getCharType() support for codepoints outside of BMP, this is necessary to support traditional chinese. a little more flexibility with tokenization, honestly I'm really not sold on indexing "words" for chinese in the first place. But words + bigrams (overlapping tokens), that would be nice. In the future it would be nice to add support for traditional chinese, and there is frequency data out there (libtabe: BSD license, etc), but we need to refactor first. As far as what to do for 2.9... I really don't know either, just let me know if you need a new patch

Simon Willnauer added a comment - 21/Jul/09 09:34

I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also.

pushing things forward is not complaining to me. I agree with you points I did not look closely into implementation details but rather on structural things. Apparently we both agree that we have work to do on this and I guess we can work out good solutions in the future together. Let's just move the classes into it's own subdir as you already did and keep the structure as it is (with the smallest changes - some classes have to be moved). If you could provide a patch I will commit the refactoring and we open a new issue for 3.*.
This solution seems to be ideal as 2.9 release is quite close...

simon

Simon Willnauer added a comment - 21/Jul/09 09:34 I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also. pushing things forward is not complaining to me. I agree with you points I did not look closely into implementation details but rather on structural things. Apparently we both agree that we have work to do on this and I guess we can work out good solutions in the future together. Let's just move the classes into it's own subdir as you already did and keep the structure as it is (with the smallest changes - some classes have to be moved). If you could provide a patch I will commit the refactoring and we open a new issue for 3.*. This solution seems to be ideal as 2.9 release is quite close... simon

Robert Muir added a comment - 21/Jul/09 09:47

Simon OK, I will work on a patch that tries to maintain the package structure.

Other than package structure, is there anything in the patch you are uncomfortable with?
I can either try to unfix any small fixes you don't like or create more testcases, whatever makes sense.

Robert Muir added a comment - 21/Jul/09 09:47 Simon OK, I will work on a patch that tries to maintain the package structure. Other than package structure, is there anything in the patch you are uncomfortable with? I can either try to unfix any small fixes you don't like or create more testcases, whatever makes sense.

Simon Willnauer added a comment - 21/Jul/09 09:56

Other than package structure, is there anything in the patch you are uncomfortable with?

no that I could tell. You can keep whatever applies to the package structure - means we might have to keep some classes public etc.

thanks for your patience! Good job!

simon

Simon Willnauer added a comment - 21/Jul/09 09:56 Other than package structure, is there anything in the patch you are uncomfortable with? no that I could tell. You can keep whatever applies to the package structure - means we might have to keep some classes public etc. thanks for your patience! Good job! simon

Robert Muir added a comment - 21/Jul/09 10:01

Simon, yes some things may have to be public that should not be due to the package structure.

I'll see if I can improve the javadocs for anything that falls in this situation as a short-term workaround.

Robert Muir added a comment - 21/Jul/09 10:01 Simon, yes some things may have to be public that should not be due to the package structure. I'll see if I can improve the javadocs for anything that falls in this situation as a short-term workaround.

Michael Busch added a comment - 22/Jul/09 04:58

So we are going to move everything currently under contrib/analyzers to contrib/analyzers/common?

Michael Busch added a comment - 22/Jul/09 04:58 So we are going to move everything currently under contrib/analyzers to contrib/analyzers/common?

Robert Muir added a comment - 22/Jul/09 05:13

yeah, except smart chinese analyzer. I am testing the latest patch (that keeps the previous smart chinese analyzer package structure), regenerating docs, etc etc.

I will upload it in a few when I think it is good to go.

Robert Muir added a comment - 22/Jul/09 05:13 yeah, except smart chinese analyzer. I am testing the latest patch (that keeps the previous smart chinese analyzer package structure), regenerating docs, etc etc. I will upload it in a few when I think it is good to go.

Robert Muir added a comment - 22/Jul/09 05:37

Simon, here is the new patch. It also has the changes to build.xml and site.xml so that javadocs are correctly linked, and the regenerated docs.

## 1. clean svn checkout
## 2. run the following commands to refactor the files.

mkdir contrib/analyzers/common
mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis
svn add contrib/analyzers/smartcn contrib/analyzers/common
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn
svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis
svn copy contrib/analyzers/build.xml contrib/analyzers/common
svn move contrib/analyzers/pom.xml.template contrib/analyzers/common
svn move contrib/analyzers/src contrib/analyzers/common
svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenFilter.java

## 3. eclipse "refresh" at project level.
## 4. set text-file encoding at project level to UTF-8
## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html
##   also manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/package.html
##   this is an existing encoding issue that is corrected by this patch.
## 6. apply patch from clipboard (you may now remove the above hack and you will notice the above files are now detected properly as UTF-8)

Robert Muir added a comment - 22/Jul/09 05:37 Simon, here is the new patch. It also has the changes to build.xml and site.xml so that javadocs are correctly linked, and the regenerated docs. ## 1. clean svn checkout ## 2. run the following commands to refactor the files. mkdir contrib/analyzers/common mkdir -p contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis svn add contrib/analyzers/smartcn contrib/analyzers/common svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/SmartChineseAnalyzer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/java/org/apache/lucene/analysis/cn/smart contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/test/org/apache/lucene/analysis/cn/TestSmartChineseAnalyzer.java contrib/analyzers/smartcn/src/test/org/apache/lucene/analysis/cn svn move contrib/analyzers/src/resources/org/apache/lucene/analysis/cn contrib/analyzers/smartcn/src/resources/org/apache/lucene/analysis svn copy contrib/analyzers/build.xml contrib/analyzers/common svn move contrib/analyzers/pom.xml.template contrib/analyzers/common svn move contrib/analyzers/src contrib/analyzers/common svn move contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenizer.java contrib/analyzers/smartcn/src/java/org/apache/lucene/analysis/cn/smart/WordTokenFilter.java ## 3. eclipse "refresh" at project level. ## 4. set text-file encoding at project level to UTF-8 ## 5. manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cn/package.html ## also manually force text-file encoding as UTF-8 for contrib/analyzers/common/src/java/org/apache/lucene/analysis/cjk/package.html ## this is an existing encoding issue that is corrected by this patch. ## 6. apply patch from clipboard (you may now remove the above hack and you will notice the above files are now detected properly as UTF-8)

Simon Willnauer added a comment - 23/Jul/09 17:12

Robert, I just committed your patch. Thanks a lot for that.
I added equals and hashcode methods to the classes you removed them just in case.

@ Uwe(or some other core commiter): could you please prepare the documentation and top level build.xml and commit it, thanks! I think robert already prepared everything in his patch.

Simon Willnauer added a comment - 23/Jul/09 17:12 Robert, I just committed your patch. Thanks a lot for that. I added equals and hashcode methods to the classes you removed them just in case. @ Uwe(or some other core commiter): could you please prepare the documentation and top level build.xml and commit it, thanks! I think robert already prepared everything in his patch.

Robert Muir added a comment - 23/Jul/09 17:20

Simon, thanks!

oh, the equals and hashcode were commented out in the original src (I removed the commented lines).

I was afraid to uncomment them (I didnt know why they were commented out),
but I shouldn't have deleted the commented lines... thanks for resolving this.

Robert Muir added a comment - 23/Jul/09 17:20 Simon, thanks! oh, the equals and hashcode were commented out in the original src (I removed the commented lines). I was afraid to uncomment them (I didnt know why they were commented out), but I shouldn't have deleted the commented lines... thanks for resolving this.

Michael McCandless added a comment - 23/Jul/09 18:28

I'll commit the top-level changes for the web-site. Thanks Robert!

Michael McCandless added a comment - 23/Jul/09 18:28 I'll commit the top-level changes for the web-site. Thanks Robert!

Simon Willnauer added a comment - 23/Jul/09 18:43

I'll commit the top-level changes for the web-site. Thanks Robert!

thanks mike!

Simon Willnauer added a comment - 23/Jul/09 18:43 I'll commit the top-level changes for the web-site. Thanks Robert! thanks mike!

Uwe Schindler added a comment - 23/Jul/09 20:29 - edited

I committed a fix for the incorrect javadocs dirs for contrib/analysis in the main build.xml.
Revision: 797213

Uwe Schindler added a comment - 23/Jul/09 20:29 - edited I committed a fix for the incorrect javadocs dirs for contrib/analysis in the main build.xml. Revision: 797213

Tomoko Uchida added a comment - 28/Aug/22 12:03

This issue was moved to GitHub issue: #2802.

Tomoko Uchida added a comment - 28/Aug/22 12:03 This issue was moved to GitHub issue: #2802 .

Lucene - Core

Move SmartChineseAnalyzer & resources to own contrib project

Details

Description

Attachments

Attachments

Activity

People

Dates