Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      Patch Available

      Description

      A tokenfilter to decompose compound words you find in many germanic languages (like German, Swedish, ...) into single tokens.

      An example: Donaudampfschiff would be decomposed to Donau, dampf, schiff so that you can find the word even when you only enter "Schiff".

      I use the hyphenation code from the Apache XML project FOP (http://xmlgraphics.apache.org/fop/) to do the first step of decomposition. Currently I use the FOP jars directly. I only use a handful of classes from the FOP project.

      My question now:
      Would it be OK to copy this classes over to the Lucene project (renaming the packages of course) or should I stick with the dependency to the FOP jars? The FOP code uses the ASF V2 license as well.

      What do you think?

      1. CompoundTokenFilter.patch
        106 kB
        Thomas Peuss
      2. CompoundTokenFilter.patch
        106 kB
        Thomas Peuss
      3. CompoundTokenFilter.patch
        105 kB
        Thomas Peuss
      4. CompoundTokenFilter.patch
        99 kB
        Thomas Peuss
      5. CompoundTokenFilter.patch
        90 kB
        Thomas Peuss
      6. CompoundTokenFilter.patch
        90 kB
        Thomas Peuss
      7. CompoundTokenFilter.patch
        91 kB
        Thomas Peuss
      8. CompoundTokenFilter.patch
        90 kB
        Thomas Peuss
      9. CompoundTokenFilter.patch
        85 kB
        Thomas Peuss
      10. CompoundTokenFilter.patch
        76 kB
        Thomas Peuss
      11. CompoundTokenFilter.patch
        71 kB
        Thomas Peuss
      12. de.xml
        48 kB
        Thomas Peuss
      13. hyphenation.dtd
        3 kB
        Thomas Peuss

        Issue Links

          Activity

          Hide
          Thomas Peuss added a comment -

          A preliminary version of the token filter.

          Show
          Thomas Peuss added a comment - A preliminary version of the token filter.
          Hide
          Thomas Peuss added a comment -
          Show
          Thomas Peuss added a comment - A hyphenation grammar. You can download them from: http://downloads.sourceforge.net/offo/offo-hyphenation.zip?modtime=1168687306&big_mirror=0
          Hide
          Thomas Peuss added a comment -

          The DTD describing the hyphenation grammar XML files.

          Show
          Thomas Peuss added a comment - The DTD describing the hyphenation grammar XML files.
          Hide
          Steve Rowe added a comment -

          Hi Thomas,

          Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project.

          Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way?

          Show
          Steve Rowe added a comment - Hi Thomas, Looking at http://offo.sourceforge.net/hyphenation/licenses.html , which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project. Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way?
          Hide
          Thomas Peuss added a comment -

          Looking at http://offo.sourceforge.net/hyphenation/licenses.html, which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project.

          This is true. And that's why I uploaded the two files without the ASF license grant. The FOP project does not have the files in the code base as well because of the licensing problem.

          Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way?

          OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of the LaTeX grammar files. I have a look into this tonight.

          All other hyphenation implementations I have found so far use them either directly or in an converted variant like the FOP code. What we can do of course is to ask the authors of the LaTeX files if they want to license their work under the ASF license as well. It is worth a try. But I suppose that many email addresses in the LaTeX files are not used anymore. I try to contact the authors of the German grammar files tomorrow.

          BTW: an example for those that don't want to try the patch:
          Input token stream:
          Rindfleischüberwachungsgesetz Drahtschere abba

          Output token stream:
          (Rindfleischüberwachungsgesetz,0,29)
          (Rind,0,4,posIncr=0)
          (fleisch,4,11,posIncr=0)
          (überwachung,11,22,posIncr=0)
          (gesetz,23,29,posIncr=0)
          (Drahtschere,30,41)
          (Draht,30,35,posIncr=0)
          (schere,35,41,posIncr=0)
          (abba,42,46)

          Show
          Thomas Peuss added a comment - Looking at http://offo.sourceforge.net/hyphenation/licenses.html , which seems to be the same information as in the off-hyphenation.zip file you attached to this issue, the license issue may be a problem - the hyphenation data is covered by different licenses on a per-language basis. For example, there are two German data files, and both are licensed under a LaTeX license, as is the Danish file, and these two languages are the most likely targets for your TokenFilter. IANAL, but unless Apache licenses can be secured for this data, I don't think the files can be incorporated directly into an Apache project. This is true. And that's why I uploaded the two files without the ASF license grant. The FOP project does not have the files in the code base as well because of the licensing problem. Also, I don't see Swedish among the hyphenation data licenses - is it covered in some other way? OFFO has no Swedish grammar file. We can generate a Swedish grammar file out of the LaTeX grammar files. I have a look into this tonight. All other hyphenation implementations I have found so far use them either directly or in an converted variant like the FOP code. What we can do of course is to ask the authors of the LaTeX files if they want to license their work under the ASF license as well. It is worth a try. But I suppose that many email addresses in the LaTeX files are not used anymore. I try to contact the authors of the German grammar files tomorrow. BTW: an example for those that don't want to try the patch: Input token stream: Rindfleischüberwachungsgesetz Drahtschere abba Output token stream: (Rindfleischüberwachungsgesetz,0,29) (Rind,0,4,posIncr=0) (fleisch,4,11,posIncr=0) (überwachung,11,22,posIncr=0) (gesetz,23,29,posIncr=0) (Drahtschere,30,41) (Draht,30,35,posIncr=0) (schere,35,41,posIncr=0) (abba,42,46)
          Hide
          Thomas Peuss added a comment -

          A Swedish hyphenation grammar is available at http://www.peuss.de/node/64

          Show
          Thomas Peuss added a comment - A Swedish hyphenation grammar is available at http://www.peuss.de/node/64
          Hide
          Thomas Peuss added a comment -

          Changes:

          • added unittest
          • minor tweaks for getting the encoding of the XML files right
          Show
          Thomas Peuss added a comment - Changes: added unittest minor tweaks for getting the encoding of the XML files right
          Hide
          Thomas Peuss added a comment -

          Updated version:

          • new dumb decomposition filter
            • uses a brute-force approach by generating substrings and checking them against the dictionary
            • seems to work better for languages that have no patterns file with a lot of special cases
            • Is roughly 3 times slower than the decomposition filter using hyphenation patterns
            • No licensing problems because of the hyphenation pattern files
          • Refactoring to have all methods used by both decomposition filters in one place
          • Minor performance improvements
          Show
          Thomas Peuss added a comment - Updated version: new dumb decomposition filter uses a brute-force approach by generating substrings and checking them against the dictionary seems to work better for languages that have no patterns file with a lot of special cases Is roughly 3 times slower than the decomposition filter using hyphenation patterns No licensing problems because of the hyphenation pattern files Refactoring to have all methods used by both decomposition filters in one place Minor performance improvements
          Hide
          Otis Gospodnetic added a comment -

          I haven't looked at the patch.
          But I'm wondering if a similar approach could be used for, say, word segmentation in Chinese?
          That is, iterate through a string of Chinese characters, buffering them and looking up the buffered string in a Chinese dictionary. Once there is a dictionary match, and the addition of the following character results in a string that has no entry in the dictionary, that previous buffered string can be considered a word/token.

          I'm not sure if your patch does something like this, but if it does, I am wondering if it is general enough that what you did can be used (as the basis of) word segmentation for Chinese, and thus for a Chinese Analyzer that's not just a dump n-gram Analyzer (which is what we have today).

          Show
          Otis Gospodnetic added a comment - I haven't looked at the patch. But I'm wondering if a similar approach could be used for, say, word segmentation in Chinese? That is, iterate through a string of Chinese characters, buffering them and looking up the buffered string in a Chinese dictionary. Once there is a dictionary match, and the addition of the following character results in a string that has no entry in the dictionary, that previous buffered string can be considered a word/token. I'm not sure if your patch does something like this, but if it does, I am wondering if it is general enough that what you did can be used (as the basis of) word segmentation for Chinese, and thus for a Chinese Analyzer that's not just a dump n-gram Analyzer (which is what we have today).
          Hide
          Thomas Peuss added a comment -

          But I'm wondering if a similar approach could be used for, say, word segmentation in Chinese? That is, iterate through a string of Chinese characters, buffering them and looking up the buffered string in a Chinese dictionary. Once there is a dictionary match, and the addition of the following character results in a string that has no entry in the dictionary, that previous buffered string can be considered a word/token. I'm not sure if your patch does something like this, but if it does, I am wondering if it is general enough that what you did can be used (as the basis of) word segmentation for Chinese, and thus for a Chinese Analyzer that's not just a dump n-gram Analyzer (which is what we have today).

          Currently the code adds a token to the stream when an n-gram from the current token in the token stream matches a word in the dictionary (I am only speaking about the DumbCompoundWordTokenFilter because I doubt that there exist hyphenation patterns for Chinese languages). I don't know much about the structure of Chinese characters to answer this questions in detail. You can have a look at the test-case in the patch to see how the filters work.

          Show
          Thomas Peuss added a comment - But I'm wondering if a similar approach could be used for, say, word segmentation in Chinese? That is, iterate through a string of Chinese characters, buffering them and looking up the buffered string in a Chinese dictionary. Once there is a dictionary match, and the addition of the following character results in a string that has no entry in the dictionary, that previous buffered string can be considered a word/token. I'm not sure if your patch does something like this, but if it does, I am wondering if it is general enough that what you did can be used (as the basis of) word segmentation for Chinese, and thus for a Chinese Analyzer that's not just a dump n-gram Analyzer (which is what we have today). Currently the code adds a token to the stream when an n-gram from the current token in the token stream matches a word in the dictionary (I am only speaking about the DumbCompoundWordTokenFilter because I doubt that there exist hyphenation patterns for Chinese languages). I don't know much about the structure of Chinese characters to answer this questions in detail. You can have a look at the test-case in the patch to see how the filters work.
          Hide
          Otis Gospodnetic added a comment -

          Thomas, I think that might work for Chinese - going through the "string" of Chinese characters, one at a time, and looking up a dictionary after each additional character. One you find a dictionary match, you look at one more character. If that matches a dictionary entry, keep doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching string of characters). If the next character does not match, then the previous/last character was the end of the dictionary entry.
          That would work, no?

          As for the license info, I think you could take the approach where the required libraries are not included in the contribution in the ASF repo, but are downloaded on the fly, at build time, much like some other contributions. Could you do that?

          Show
          Otis Gospodnetic added a comment - Thomas, I think that might work for Chinese - going through the "string" of Chinese characters, one at a time, and looking up a dictionary after each additional character. One you find a dictionary match, you look at one more character. If that matches a dictionary entry, keep doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching string of characters). If the next character does not match, then the previous/last character was the end of the dictionary entry. That would work, no? As for the license info, I think you could take the approach where the required libraries are not included in the contribution in the ASF repo, but are downloaded on the fly, at build time, much like some other contributions. Could you do that?
          Hide
          Thomas Peuss added a comment -

          Thomas, I think that might work for Chinese - going through the "string" of Chinese characters, one at a time, and looking up a dictionary after each additional character. One you find a dictionary match, you look at one more character. If that matches a dictionary entry, keep doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching string of characters). If the next character does not match, then the previous/last character was the end of the dictionary entry. That would work, no?

          I have started to look into this. I will add the constructor parameter "onlyLongestMatch" (default is false).

          As for the license info, I think you could take the approach where the required libraries are not included in the contribution in the ASF repo, but are downloaded on the fly, at build time, much like some other contributions. Could you do that?

          I pull the grammar files for the tests already. But I don't know if it makes sense to pull them on build time because the end-user can easily download them. I need the XML versions now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes from the FOP project - they use the ASF license as well).

          Show
          Thomas Peuss added a comment - Thomas, I think that might work for Chinese - going through the "string" of Chinese characters, one at a time, and looking up a dictionary after each additional character. One you find a dictionary match, you look at one more character. If that matches a dictionary entry, keep doing that until you keep matching dictionary entries (in order to grab the longest dictionary-matching string of characters). If the next character does not match, then the previous/last character was the end of the dictionary entry. That would work, no? I have started to look into this. I will add the constructor parameter "onlyLongestMatch" (default is false). As for the license info, I think you could take the approach where the required libraries are not included in the contribution in the ASF repo, but are downloaded on the fly, at build time, much like some other contributions. Could you do that? I pull the grammar files for the tests already. But I don't know if it makes sense to pull them on build time because the end-user can easily download them. I need the XML versions now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes from the FOP project - they use the ASF license as well).
          Hide
          Thomas Peuss added a comment -

          Updated patch according to Otis suggestions for longest match.

          Next step: move to contrib

          Show
          Thomas Peuss added a comment - Updated patch according to Otis suggestions for longest match. Next step: move to contrib
          Hide
          Thomas Peuss added a comment -

          Moved the compound word tokenfilter stuff to contrib.

          Show
          Thomas Peuss added a comment - Moved the compound word tokenfilter stuff to contrib.
          Hide
          Thomas Peuss added a comment -

          Moved compound word token filter to contrib.

          Show
          Thomas Peuss added a comment - Moved compound word token filter to contrib.
          Hide
          Thomas Peuss added a comment -

          Dropped Java5 dependencies.

          Show
          Thomas Peuss added a comment - Dropped Java5 dependencies.
          Hide
          Thomas Peuss added a comment -

          Fixed a compilation bug in the testcase.

          Show
          Thomas Peuss added a comment - Fixed a compilation bug in the testcase.
          Hide
          Grant Ingersoll added a comment -

          I pull the grammar files for the tests already. But I don't know if it makes sense to pull them on build time because the end-user can easily download them. I need the XML versions now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes from the FOP project - they use the ASF license as well).

          I think they have to download automatically, otherwise the automated tests, etc. will not run. I applied the patch and ran "ant test" and it fails b/c I didn't download the files.

          Also, much of the code has author tags that are not you, I am assuming you got it from FOP per your comments above, but can you explicitly mark all the files as to there origin?

          Show
          Grant Ingersoll added a comment - I pull the grammar files for the tests already. But I don't know if it makes sense to pull them on build time because the end-user can easily download them. I need the XML versions now - so the jar-file from Sourceforge does not help anymore (I have included the needed classes from the FOP project - they use the ASF license as well). I think they have to download automatically, otherwise the automated tests, etc. will not run. I applied the patch and ran "ant test" and it fails b/c I didn't download the files. Also, much of the code has author tags that are not you, I am assuming you got it from FOP per your comments above, but can you explicitly mark all the files as to there origin?
          Hide
          Thomas Peuss added a comment -

          All files in the package org.apache.lucene.analysis.compound.hyphenation are from the FOP project (as well ASF licensed). Should I add a comment to them to state from where they are? All other files are from me. I have to check why it fails when you run "ant test" by downloading a fresh copy of Lucene-trunk.

          Show
          Thomas Peuss added a comment - All files in the package org.apache.lucene.analysis.compound.hyphenation are from the FOP project (as well ASF licensed). Should I add a comment to them to state from where they are? All other files are from me. I have to check why it fails when you run "ant test" by downloading a fresh copy of Lucene-trunk.
          Hide
          Thomas Peuss added a comment -

          The error is

              [junit] Testsuite: org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter
              [junit] Tests run: 4, Failures: 0, Errors: 2, Time elapsed: 2,139 sec
              [junit]
              [junit] Testcase: testHyphenationCompoundWordsDE(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter):  Caused an ERROR
              [junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
              [junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
              [junit]     at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123)
              [junit]     at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138)
              [junit]     at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142)
              [junit]     at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDE(TestCompoundWordTokenFilter.java:70)
              [junit]
              [junit]
              [junit] Testcase: testHyphenationCompoundWordsDELongestMatch(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter):      Caused an ERROR
              [junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
              [junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory)
              [junit]     at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123)
              [junit]     at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138)
              [junit]     at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142)
              [junit]     at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDELongestMatch(TestCompoundWordTokenFilter.java:96)
              [junit]
              [junit]
              [junit] Test org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter FAILED
          

          So it does not find the hyphenation.dtd. I have to investigate how I can make that DTD know to the parser without copying the hyphenation.dtd to Lucene's base directory.

          Show
          Thomas Peuss added a comment - The error is [junit] Testsuite: org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter [junit] Tests run: 4, Failures: 0, Errors: 2, Time elapsed: 2,139 sec [junit] [junit] Testcase: testHyphenationCompoundWordsDE(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter): Caused an ERROR [junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory) [junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory) [junit] at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123) [junit] at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138) [junit] at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142) [junit] at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDE(TestCompoundWordTokenFilter.java:70) [junit] [junit] [junit] Testcase: testHyphenationCompoundWordsDELongestMatch(org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter): Caused an ERROR [junit] File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory) [junit] org.apache.lucene.analysis.compound.hyphenation.HyphenationException: File not found: /home/thomas/projects/lucene-trunk-compound/hyphenation.dtd (No such file or directory) [junit] at org.apache.lucene.analysis.compound.hyphenation.PatternParser.parse(PatternParser.java:123) [junit] at org.apache.lucene.analysis.compound.hyphenation.HyphenationTree.loadPatterns(HyphenationTree.java:138) [junit] at org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter.getHyphenationTree(HyphenationCompoundWordTokenFilter.java:142) [junit] at org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter.testHyphenationCompoundWordsDELongestMatch(TestCompoundWordTokenFilter.java:96) [junit] [junit] [junit] Test org.apache.lucene.analysis.compound.TestCompoundWordTokenFilter FAILED So it does not find the hyphenation.dtd. I have to investigate how I can make that DTD know to the parser without copying the hyphenation.dtd to Lucene's base directory.
          Hide
          Thomas Peuss added a comment -
          • Fixed the problem with the hyphenation.dtd file that was not found
          • Removed all @author tags
          • Added a note to all files I copied from the FOP project
          • Added package.html files (not very much in there - but credits for the FOP project)
          Show
          Thomas Peuss added a comment - Fixed the problem with the hyphenation.dtd file that was not found Removed all @author tags Added a note to all files I copied from the FOP project Added package.html files (not very much in there - but credits for the FOP project)
          Hide
          Grant Ingersoll added a comment -

          So, why would I ever want to use a "Dumb" compound filter? Any suggestions for a better name? No need for a patch, I can just make the change.

          Show
          Grant Ingersoll added a comment - So, why would I ever want to use a "Dumb" compound filter? Any suggestions for a better name? No need for a patch, I can just make the change.
          Hide
          Grant Ingersoll added a comment -

          This looks pretty good Thomas. I think the last bit that would be good is to add to the package docs an example of start to finish using it, kind of like in the test case. You might want to explain a little bit about where to get the hyphenation files, etc. (if I am understanding them correctly).

          I think if we can finish that up, we can look to commit.

          The other interesting thing here, as an aside, is the Ternary Tree might be worth pulling up to a "util" package (no need to do so now, just thinking out loud), as it could be used for other interesting things. For instance, see http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html The version we have needs a little work, but I have been thinking about how it might be used to improve spelling, etc.

          Show
          Grant Ingersoll added a comment - This looks pretty good Thomas. I think the last bit that would be good is to add to the package docs an example of start to finish using it, kind of like in the test case. You might want to explain a little bit about where to get the hyphenation files, etc. (if I am understanding them correctly). I think if we can finish that up, we can look to commit. The other interesting thing here, as an aside, is the Ternary Tree might be worth pulling up to a "util" package (no need to do so now, just thinking out loud), as it could be used for other interesting things. For instance, see http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html The version we have needs a little work, but I have been thinking about how it might be used to improve spelling, etc.
          Hide
          Otis Gospodnetic added a comment -

          Ah, TST was in there, lovely! +1 to what Grant said about getting it into util later.
          Noticed a misspelling in javadoc while glancing at TST: hibrid -> hybrid

          Show
          Otis Gospodnetic added a comment - Ah, TST was in there, lovely! +1 to what Grant said about getting it into util later. Noticed a misspelling in javadoc while glancing at TST: hibrid -> hybrid
          Hide
          Grant Ingersoll added a comment -

          cool. It needs some work, IMO to add more features, per that article
          I sent, but no biggie.

          I saw some of those, but purposely left in the FOP typos... There were
          more than just that one.

          --------------------------
          Grant Ingersoll

          Lucene Helpful Hints:
          http://wiki.apache.org/lucene-java/BasicsOfPerformance
          http://wiki.apache.org/lucene-java/LuceneFAQ

          Show
          Grant Ingersoll added a comment - cool. It needs some work, IMO to add more features, per that article I sent, but no biggie. I saw some of those, but purposely left in the FOP typos... There were more than just that one. -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
          Hide
          Thomas Peuss added a comment -

          So, why would I ever want to use a "Dumb" compound filter? Any suggestions for a better name? No need for a patch, I can just make the change.

          A better name would be DictionaryCompoundWordTokenFilter. I called it "Dumb" because it uses a brute-force approach. But DictionaryCompoundWordTokenFilter characterizes it better.

          Show
          Thomas Peuss added a comment - So, why would I ever want to use a "Dumb" compound filter? Any suggestions for a better name? No need for a patch, I can just make the change. A better name would be DictionaryCompoundWordTokenFilter . I called it "Dumb" because it uses a brute-force approach. But DictionaryCompoundWordTokenFilter characterizes it better.
          Hide
          Thomas Peuss added a comment -
          • Renamed DumbCompoundWordTokenFilter to DictionaryCompoundWordTokenFilter
          • Added more text to the package description file (package.html)
          • Removed some code that was necessary because of LUCENE-1163 (in HyphenationCompoundWordTokenFilter and DictionaryCompoundWordTokenFilter
            )
          Show
          Thomas Peuss added a comment - Renamed DumbCompoundWordTokenFilter to DictionaryCompoundWordTokenFilter Added more text to the package description file (package.html) Removed some code that was necessary because of LUCENE-1163 (in HyphenationCompoundWordTokenFilter and DictionaryCompoundWordTokenFilter )
          Hide
          Thomas Peuss added a comment -
          • Minor bugfix in DictionaryCompoundWordFilter: it was not using the maxSubwordSize parameter
          • Major performance improvement for the DictionaryCompoundWordTokenFilter: we now convert all dictionary strings to lower case before adding them to the CharArraySet and set the ignoreCase parameter of CharArraySet to false. The filter makes a lower case copy of the token before it starts working on it. This avoids many toLowerCase() calls in CharArraySet.
          • Minor performance improvement for the HyphenationCompoundWordTokenFilter: see above
          Show
          Thomas Peuss added a comment - Minor bugfix in DictionaryCompoundWordFilter: it was not using the maxSubwordSize parameter Major performance improvement for the DictionaryCompoundWordTokenFilter: we now convert all dictionary strings to lower case before adding them to the CharArraySet and set the ignoreCase parameter of CharArraySet to false. The filter makes a lower case copy of the token before it starts working on it. This avoids many toLowerCase() calls in CharArraySet. Minor performance improvement for the HyphenationCompoundWordTokenFilter: see above
          Hide
          François Terrier added a comment -

          Is there any plan of integrating this patch in the official lucene libraries in the short term ?

          Show
          François Terrier added a comment - Is there any plan of integrating this patch in the official lucene libraries in the short term ?
          Hide
          Grant Ingersoll added a comment -

          Yes.

          --------------------------
          Grant Ingersoll

          Lucene Helpful Hints:
          http://wiki.apache.org/lucene-java/BasicsOfPerformance
          http://wiki.apache.org/lucene-java/LuceneFAQ

          Show
          Grant Ingersoll added a comment - Yes. -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
          Hide
          Grant Ingersoll added a comment - - edited

          I'm now getting:
          ..../lucene/java/lucene-clean/contrib/analyzers/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java:60: warning: unmappable character for encoding utf-8
          [javac] "Aufgabe", "Überwachung" };

          Can you convert the classes in question to UTF-8 for the source?

          Show
          Grant Ingersoll added a comment - - edited I'm now getting: ..../lucene/java/lucene-clean/contrib/analyzers/src/test/org/apache/lucene/analysis/compound/TestCompoundWordTokenFilter.java:60: warning: unmappable character for encoding utf-8 [javac] "Aufgabe", "Überwachung" }; Can you convert the classes in question to UTF-8 for the source?
          Hide
          Thomas Peuss added a comment -

          UTF-8 problem fixed...

          Show
          Thomas Peuss added a comment - UTF-8 problem fixed...
          Hide
          Grant Ingersoll added a comment -

          Committed revision 657027.

          Show
          Grant Ingersoll added a comment - Committed revision 657027.

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Thomas Peuss
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development